Databases: Indexes

Premature optimization is the root of all evil.

Introduction

Premature optimization is the root of all evil. (Donald Knuth)

In this one, we're taking a look at another important database pattern: database indexing. We won't discuss the implementation of any relational DBMS, just the theory. In 99% of cases, that's all you need.

When you're querying records at scale, this should be something you're familiar with to improve performance. When you're writing records at scale, this should be something you're familiar with to avoid premature optimizations and performance degradation.

In less words: indexing speeds up reads from the database

SELECT * FROM my_table

and slows down writes to the database.

INSERT INTO (x) VALUES (...)

If this doesn't make sense yet, don't fret.

Think of this as doing work up front. If you're writing a program at work, or in an interview, sometimes your code might call for some sort of pre-processing of the input data to sustain an efficient algorithm. It's the same concept with databases.

But, to get your program to run faster, you usually have to give something up. We'll talk about this in a bit.

Theory

We'll start by thinking about what happens when you don't have indexes. If you've ever run something like EXPLAIN [query] in your SQL console, you'll notice that some of the time is taken up by a row-scan.

Each time you run a read query, this row-scan occurs. If you don't have any indexes, this will typically do a linear scan on your entire table.

In small tables, this is okay. In large tables, this can be awful.

If you have a table with millions of records and many columns, your read queries will be incredibly slow without proper indexing. If your database has to run a linear row-scan through millions of records, all while dealing with writes and locks... good luck.

Database indexes are essentially pointers in your database, typically mapped in a B-Tree. Your database can then sort candidate records by this key (index), and search rows more efficiently.

Instead of doing a row-scan on the origin table, the database will go through a list of pointers based on your search conditions and fetch them directly.

INDEX COLUMN | POINTER
-------------------------
indexed_id_1 | pointer_1
indexed_id_2 | pointer_2
-------------------------

Implementation

In most database management systems (eg: PostgreSQL), primary keys are indexed by default. This also applies to foreign key constraints you add to a table.

Let's say we have a table called person:

CREATE TABLE person (
	person_id serial PRIMARY KEY,
	username VARCHAR ( 50 ) UNIQUE NOT NULL,
	password VARCHAR ( 50 ) NOT NULL,
	email VARCHAR ( 255 ) UNIQUE NOT NULL,
	created_on TIMESTAMP NOT NULL,
 last_login TIMESTAMP
);

We already index on person_id, since it's the primary key, and we currently have no foreign key constraints. But, for our application, we might want to query person(s) effectively based on their email. If we include email as a search parameter frequently (and/or with large data loads), it's probably best to add an index.

CREATE INDEX idx_person_email
ON person(email);

If we had run EXPLAIN SELECT on our query, the planner would have given us an operation: Seq Scan, which is a linear row scan, including/excluding all rows in the table by email. If we have millions or billions of people in our table, this would be awful.

QUERY PLAN
----------
Seq Scan on person (..)

After indexing on the email column, if we were to run EXPLAIN SELECT again, we would get a new operation.

QUERY PLAN
----------
Index Scan using idx_person_email (...)

Much better for our larger tables, but there's more to consider.

Considerations

The flip side of this performance boost is writing to your database. Writes on large (and frequently used) tables can already be expensive. Now, if we have a large table with indexed columns, we need to update this index data structure for each insertion or update. For every column that is indexed.

This is why indexes should still be used sparingly. Many an engineer discover indexing for the first time and go wild. Then their table gets bigger: it's still blazing fast to query, but it's slow to write and update records.

Get rid of those unnecessary indexes. You probably only use a few of them: this is a prime and common case of premature optimization.

Conclusion

Indexes are an essential tool for any database usage at scale. They should be applied judiciously, only when needed for faster reads, and avoided when not absolutely required.

If you want some further reading on implementations, take a look at the resources below.

I hope you enjoy; until next time.