How to do ORDER BY RANDOM() on large tables?

Published on Mar. 27, 2017 by Gabriel Bordeaux

Introduction

Getting a random row from a PostgreSQL table has numerous use cases. To process an instruction like "ORDER BY RANDOM()", PostgreSQL has to fetch all rows and then pick one randomly. It's a fast process on small tables with up to a few thousand rows but it becomes very slow on large tables. This article will present examples and a tentative solution.

Try it yourself

If you would like to try the examples yourself, please create this table and use the following query to insert 100 million random rows:

-- Create table -big_data-
CREATE TABLE big_data (
   id serial unique,
   some_data text
);

-- Create an index on the ID
CREATE INDEX ON big_data (id);

-- Insert 100 million test rows
-- You can adapt the number of rows to reproduce the test with a smaller dataset.
INSERT INTO big_data (some_data)
SELECT md5(random()::text) AS some_data
FROM generate_series(1,100000000);

Using "ORDER BY RANDOM() LIMIT 1"

Let's run a basic query to fetch a random row from the table:

gab=# SELECT * FROM big_data ORDER BY RANDOM() LIMIT 1;
    id    |            some_data             
----------+----------------------------------
 63303598 | 791f7773fbdecd71176543baa163b6eb
(1 row)

Time: 52294.049 ms

The query took over 52 seconds. That's because PostgreSQL had to fetch all rows from the table to then select one as you can see below:

gab=# EXPLAIN ANALYZE SELECT * FROM big_data ORDER BY RANDOM() LIMIT 1;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2583335.40..2583335.40 rows=1 width=37) (actual time=65955.155..65955.156 rows=1 loops=1)
   ->  Sort  (cost=2583335.40..2833335.60 rows=100000080 width=37) (actual time=65955.154..65955.154 rows=1 loops=1)
         Sort Key: (random())
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Seq Scan on big_data  (cost=0.00..2083335.00 rows=100000080 width=37) (actual time=0.940..31292.882 rows=100000000 loops=1)
 Planning time: 0.106 ms
 Execution time: 65955.191 ms
(7 rows)

Time: 65955.640 ms

A more efficient solution

An efficient solution is to fetch the maximum ID from the table (which is very quick because PostgreSQL keeps the highest ID in cache in order to retrieve the next one for the next insert) and multiply it with RANDOM() which generates a random value between 0 and 1. As this calculation will return a number with decimals, we can get the nearest integer with ROUND(). Finally, we will use a CASE condition to manage the very edge case where RANDOM() would be '0'. In this case we will take '1' as the selected ID.

Here are example queries to illustrate the process:

gab=# SELECT MAX(id) FROM big_data;
    max    
-----------
 100000000
(1 row)Time: 0.837 ms

gab=# SELECT RANDOM() * (SELECT MAX(id) FROM big_data) as id;
        id        
------------------
 43463763.5014951
(1 row)
Time: 0.615 ms

gab=# SELECT ROUND(RANDOM() * (SELECT MAX(id) FROM big_data)) as id;
    id    
----------
 52114526
(1 row)
Time: 0.752 ms

SELECT CASE WHEN id = 0 THEN 1 ELSE id END
FROM (SELECT ROUND(RANDOM() * (SELECT MAX(id) FROM big_data)) as id) as r;
    id    
----------
 76692652
(1 row)
Time: 0.797 ms

The first idea that comes into our mind is to use the above query in a WHERE clause. However, as this example illustrates, it's about as slow as our initial "ORDER BY RANDOM() LIMIT 1":

gab=# SELECT *
FROM big_data
WHERE id = (
             SELECT CASE WHEN id = 0 THEN 1 ELSE id END
             FROM (SELECT ROUND(RANDOM() * (SELECT MAX(id) FROM big_data)) as id) as r
           );
    id    |            some_data             
----------+----------------------------------
 68848571 | 9cd22661cf8db1fbb7883f4569b14137
(1 row)
Time: 27502.304 ms

gab=# EXPLAIN ANALYZE SELECT *
FROM big_data
WHERE id = (
             SELECT CASE WHEN id = 0 THEN 1 ELSE id END
             FROM (SELECT ROUND(RANDOM() * (SELECT MAX(id) FROM big_data)) as id) as r
           );
                                                                                             QUERY PLAN                                                                                              
------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on big_data  (cost=0.65..2333335.85 rows=500000 width=37) (actual time=24565.030..30117.638 rows=1 loops=1)
   Filter: ((id)::double precision = $2)
   Rows Removed by Filter: 99999996
   InitPlan 3 (returns $2)
     ->  Subquery Scan on r  (cost=0.61..0.65 rows=1 width=8) (actual time=0.078..0.078 rows=1 loops=1)
           ->  Result  (cost=0.61..0.63 rows=1 width=0) (actual time=0.073..0.073 rows=1 loops=1)
                 InitPlan 2 (returns $1)
                   ->  Result  (cost=0.60..0.61 rows=1 width=0) (actual time=0.063..0.064 rows=1 loops=1)
                         InitPlan 1 (returns $0)
                           ->  Limit  (cost=0.57..0.60 rows=1 width=4) (actual time=0.058..0.058 rows=1 loops=1)
                                 ->  Index Only Scan Backward using big_data_id_idx on big_data big_data_1
                                     (cost=0.57..3680090.97 rows=100000080 width=4) (actual time=0.056..0.056 rows=1 loops=1)
                                       Index Cond: (id IS NOT NULL)
                                       Heap Fetches: 1
 Planning time: 0.304 ms
 Execution time: 30117.698 ms
(15 rows)
Time: 30118.620 ms

As we can see above, the database engine tried to match every row from the table and had to remove almost all of them (see "Rows Removed by Filter: 99999996").

The most efficient solution is to use 2 queries (the first one to calculate the ID and the second one to retrieve the corresponding row):

gab=# SELECT CASE WHEN id = 0 THEN 1 ELSE id END
FROM (SELECT ROUND(RANDOM() * (SELECT MAX(id) FROM big_data)) as id) as r;
    id    
----------
 45125146
(1 row)
Time: 0.511 ms

SELECT * FROM big_data WHERE id = 45125146;
    id    |            some_data             
----------+----------------------------------
 45125146 | 5589c2f8f711ce9c149b5d2e05b99afb
(1 row)
Time: 2.115 ms

There is no way to keep PostgreSQL from scanning more than one row in a single query (neither Common Table Expressions or JOIN will solve this issue). Using 2 queries is acceptable, however, this solution to the problem has a major flaw: if any row was created then deleted, it might calculate an ID that is no longer in the table.

How to manage missing IDs?

If some rows were deleted from our "big_data" table, the randomized ID generation might generate an ID missing from the table. A code solution could be to loop this query until a valid ID is found. To manage this aspect from the database side, we can create a mapping table with sequential IDs (from 1 to Nth rows in the "big_data" table) and its matching ID is "big_data":

-- Delete a few IDs from "big_data":
gab=# DELETE FROM big_data WHERE id IN (2, 4, 6);
DELETE 3
Time: 2.779 ms

-- Create a mapping table:
gab=# CREATE TABLE big_data_mapper (id serial, big_data_id int);
CREATE TABLE
Time: 4.628 ms

-- Adding an index on "id":
gab=# CREATE INDEX ON big_data_mapper (id);
CREATE INDEX
Time: 1.878 ms

-- Adding an index on "big_data_id":
gab=# CREATE INDEX ON big_data_mapper (big_data_id);
CREATE INDEX
Time: 1.434 ms

-- Fill the mapping table:
-- Note: you can also insert in the mapping table only a subset of the "big_data" table if you want to obtain a random ID from a subset of "big_data"
gab=# INSERT INTO big_data_mapper (big_data_id) SELECT id FROM big_data ORDER BY id;
INSERT 0 99999997
Time: 1246586.848 ms

Missing IDs from "big_data" are now matched with a sequential ID from "big_data_mapper":

gab=# SELECT * FROM big_data_mapper LIMIT 10;
 id | big_data_id 
----+-------------
  1 |           1
  2 |           3
  3 |           5
  4 |           7
  5 |           8
  6 |           9
  7 |          10
  8 |          11
  9 |          12
 10 |          13
(10 rows)
Time: 1.936 ms

We can now find a random ID from "big_data" with an initial query in the mapping table, a second query to retrieve the corresponding ID from "big_data" and a third one to retrieve the corresponding row in "big_data":

gab=# SELECT CASE WHEN id = 0 THEN 1 ELSE id END
FROM (SELECT ROUND(RANDOM() * (SELECT MAX(id) FROM big_data)) as id) as r;
    id    
----------
 69554584
(1 row)

Time: 0.840 ms

gab=# SELECT big_data_id FROM big_data_mapper WHERE id = 69554584;
 big_data_id 
-------------
    69554587
(1 row)
Time: 0.317 ms

gab=# SELECT * FROM big_data WHERE id = 69554587;
    id    |            some_data             
----------+----------------------------------
 69554587 | 54bf03a14ae39facc30a9ffdbfdf40cb
(1 row)
Time: 0.443 ms

The full process was done in 1.6 millisecond. Not bad for retrieving a random row from a table with 99,999,997 rows!

How to maintain the mapping table?

The problem of missing IDs will exist again if new rows are deleted from "big_data", if IDs are updated (this should not be done anyway!) or if new rows are added (they won't exist in the mapping table).

There are two solutions:

Re-generating the mapping table when needed (I won't detail that one as it's a very trivial process)
Creating triggers on the "big_data" table to update the mapping

Here are 3 function + triggers to maintain "big_data_mapper":

-- Insert a new row in big_data_mapper for each new row in big_data
CREATE OR REPLACE FUNCTION big_data_insert_map()
    RETURNS trigger AS
$BODY$
BEGIN
    INSERT INTO big_data_mapper (big_data_id) VALUES (NEW.id);
    RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql;

CREATE TRIGGER big_data_insert
    AFTER INSERT
    ON big_data
    FOR EACH ROW
EXECUTE PROCEDURE big_data_insert_map();

-- Update IDs in big_data_mapper for each ID updated in big_data
CREATE OR REPLACE FUNCTION big_data_update_map()
    RETURNS trigger AS
$BODY$
BEGIN
    IF NEW.id <> OLD.id THEN
        UPDATE big_data_mapper SET big_data_id = NEW.id WHERE big_data_id = OLD.id;
    END IF;
    RETURN OLD;
END;
$BODY$
LANGUAGE plpgsql;

CREATE TRIGGER big_data_update
    BEFORE UPDATE
    ON big_data
    FOR EACH ROW
EXECUTE PROCEDURE big_data_update_map();

-- Delete a matching row in big_data_mapper for each row deleted in big_data and shift big_data_mapper IDs
CREATE OR REPLACE FUNCTION big_data_delete_map()
    RETURNS trigger AS
$BODY$
BEGIN
    DELETE FROM big_data_mapper WHERE big_data_id = OLD.id;
    UPDATE big_data_mapper SET id = id - 1 WHERE big_data_id > OLD.id;
    RETURN OLD;
END;
$BODY$
LANGUAGE plpgsql;

CREATE TRIGGER big_data_delete
    BEFORE DELETE
    ON big_data
    FOR EACH ROW
EXECUTE PROCEDURE big_data_delete_map();

The major limitation is that the trigger used while deleting a row might be very slow to execute as it needs to shift all IDs above the one deleted.

Conclusion

The best method to find a random row from a huge table in a few milliseconds is:

Create a mapping table to be able to manage missing IDs from your big table
Maintain the mapping table with three triggers to manage INSERT, DELETE and UPDATE in your big table
Use 3 consecutive queries to fetch a random ID