Share this page: Follow me to be notified of new articles:

PostgreSQL: Search dates on large tables with pg_IdFromDate()

Introduction

pg_IdFromDate() is a small set of PostgreSQL stored procedures allowing you to search a row based on a date or a timestamp in a few milliseconds, even on very large tables. It uses dichotomy to progressively eliminate ranges of rows until a small number of rows is found then a regular query allows it to find the ID of the row the closest to a specific date.
This process is very different from how the database traditionally works and allows you to find the row corresponding to a date in less than 10 ms for a table with > 100 million lines.

Try it yourself

If you would like to try the examples yourself, please create the following table:

Source code

You can find the source code on Github:
Watch Star Fork Download

How does it works?

pg_IdFromDate() uses a simple mathematical concept called dichotomy. Simply explained, we separate the table into 2 groups, A and B. We will search the middle between A and B and find if the date is below or above this middle row. If the date is below (it's within group A), we will subdivide group A into 2 sub-groups, A' and A'' and repeat the same operation until the final group has less than 50 rows.
When that happens we will run a simple query to find the matching row within that range.
pg_IdFromDate() takes 3 input: the table name, the name of the "date" column and the timestamp to search.

Examples

Here are a few examples showing how pg_IdFromDate() works:
-- Search a timestamp
gab@gab # SELECT pg_IdFromDate('test', 'date', '2015-08-09 16:53:00');
 pg_idfromdate 
---------------
          5298
(1 row)
Time: 3.710 ms

-- Search a date
SELECT pg_IdFromDate('test', 'date', '2015-09-03');
 pg_idfromdate 
---------------
          5881
(1 row)
Time: 4.754 ms

-- Search a dynamic timestamp
SELECT pg_IdFromDate('test', 'date', (NOW() - interval '1 week')::timestamp);
 pg_idfromdate 
---------------
          2134
(1 row)
Time: 6.868 ms

-- Search a date with no matching result
SELECT pg_IdFromDate('test', 'date', '2016-08-08 23:32:45');
 pg_idfromdate 
---------------
        {NULL}
(1 row)
Time: 3.710 ms

Benchmark

Here is a benchmark comparing the performance of pg_IdFromDate() versus a regular SQL query:
Method "...WHERE date = '[DATE]';" pg_IdFromDate() Result
Querying a timestamp (no index) 25,225.50 ms 7.07 ms pg_IdFromDate() is 3,568x faster
Querying a date (no index) 41,174.18 ms 6.56 ms pg_IdFromDate() is 6,276x faster
Querying a timestamp (+ index) 12.44 ms 7.11 ms pg_IdFromDate() is 2x faster
Querying a date (+ index) 43,361.46 ms 7.27 ms pg_IdFromDate() is 5,964x faster

The detail of the queries used for the benchmark are available on the Github project.

Limitations

pg_IdFromDate() is a project that I hope will evolve soon. Here are the current limitations:

Conclusion

This tool was developed to easily find rows by dates in tables with hundred of million of rows. It's methodology allows to search rows efficiently even on very large tables.
You can view and download the project on Github.
Last update of this page: March 27, 2017
Please do not copy any of the content of this website without asking me first.