Ariel Rodriguez Romero

Learning DB optimizations

It all started with the following query:

SELECT * FROM "downloader_jobs"
WHERE (
  "downloader_jobs"."created_at" > '2017-02-05T00:00:00+00:00'::timestamptz AND
  "downloader_jobs"."no_retry" = false AND
  "downloader_jobs"."retried" = false AND
  "downloader_jobs"."success" = false
)
ORDER BY "downloader_jobs"."id" ASC
LIMIT 2001

We got it from the logs and it was taking more than 2 seconds. At the time of writing, there were 4,376,662 rows in the downloader_jobs table, which in the era of big data doesn’t seem like much.

Taking a look at the query plan with explain analyze we get:

 Seq Scan on public.downloader_jobs  (cost=0.00..149446.10 rows=857608 width=177) (actual time=716.124..716.124 rows=0 loops=1)
   Output: id, downloader_task, accession_code, ...
   Filter: ((NOT downloader_jobs.success) AND (NOT downloader_jobs.no_retry) AND (NOT downloader_jobs.retried) AND (downloader_jobs.created_at > '2017-02-05 00:00:00'::timestamp without time zone))
   Rows Removed by Filter: 4376662
 Planning time: 0.082 ms
 Execution time: 716.151 ms
(6 rows)

That is a slightly modified query without the order by clause, but still starts to give some insights on the problem. A sequential scan on a large table is often a bad sign, since all of the rows are being loaded to filter out the ones that don’t match the filter.

In Django, creating indexes is not complicated, the real question is what to index? It’s been a while since I’m out of school and I haven’t done much database stuff ever since, except for maybe discussions about the theory. I started reading this sql indexing and tunning book to gain a better understanding of the whole thing, and there were a lot of things that I learned.

Turns out that each index is a BTree (which I think I knew) that sorts whatever keys you specify into a tree structure. The main advantage is that filtering on the keys has logarithmic time. All of the rows are connected with a double linked list, and the BTree can be used to find a range of rows that might fulfil a given query. Databases have a query planner module that decides how to execute a given query. The basic gist related to indexes is that from the conditions of a query, the best ones can be used to search in the keys of the BTree because that’s logarithmic time. The filters that can’t be used like this have to be applied sequentially on ALL ROWS that the BTree founds and this is the big problem, because all these rows have to (potentially) be loaded in memory to apply the other operations.

There’s still the question of what fields to select to add to the index, for this it’s very important to know the data and how it’s queried. A quick search for usages of the downloader_jobs table reveals that the following parameters are used together, each set used in a different query.

  • success, retried, no_retry, created_at
  • success, retried, no_retry, created_at, start_time, end_time
  • success
  • id

id is a primary key, and already has an index that is added automatically by postgresql when the table was created. For the others seems like an index with the fields success, retried, no_retry, created_at in that order would suffice.

However, upon further investigation, it’s not a good idea to create an index on a boolean column, because “it is cheaper to sequentially read the whole table than to use random I/O on the index and the table if a high percentage of the table has to be retrieved”. So seems like the ideal solution could be a partial index on created_at with the conditions that are applied to success, retried, no_retry.

I was tempted to have three fields in the filter, created_at, start_time and end_time but created_at is always used as a range where we select all jobs that were created after a given date. This might have been a good idea if there were many rows that were created at the same time, and the queries didn’t use a range but selected with an equality. The main issue with this index is that the sorting of rows by start_time after created_at is irrelevant because it’s very unlikely that two jobs will be created at the same time.

Random key points I saw while reading

  • It’s important not to add too many indexes because it degrades the performance of INSERT, UPDATE and DELETE operations.

  • Rule of thumb: Index for equality first—then for ranges.

  • Leaf node transversal is the big problem

Conclusion

Submitted everything in a pull request to refine.bio, it’s fairly simple after all the reading, but I think it was important to get a good background in order to be more confident on the decision.

I just discovered these tools, and it already feels like a super power. I’m gonna try to be more mindful when writing queries in the future, and try to think of how they might be executed in the DB.

Related posts

Get route information from Google Flights

Simple strategy to scrape information from google flights (with manual work). It’s helpful if you don’t need a lot of information.

🐞 Diagnosing a bug when we were running tximport

I’m still familiarizing myself with the backend of refine.bio. Here I dive deep trying to find what was causing an issue that prevented us from processing some experiments with tximport.

Calculating advanced aggregations with Elastic Search and Django

There are several packages that depend on each other and can be used to set up elastic search in a Django project. Here I mention the relationships between them and how to extend them to calculate complex aggregations.