PostgreSQL's subqueries: multiple columns, better and faster results

Published on Mar. 27, 2017 by Gabriel Bordeaux

Together we can do it

Introduction

Subqueries are used every day by most developers and DBAs. We will present below several solutions to write the same subquery and get the same output using more or less resources from the database engine.

Try it yourself

If you would like to try the examples yourself, please create these 2 tables with some random content:

-- Create table -t1-
CREATE TABLE t1 (
   id serial unique,
   some_data text
);

-- Create an index on the ID
CREATE INDEX ON t1 (id);

-- Insert 10,000 test rows in t1
INSERT INTO t1 (some_data)
SELECT md5(random()::text) AS some_data
FROM generate_series(1,10000);

-- Create table -t2-
CREATE TABLE t2 (
   id serial unique,
   t1_id int REFERENCES t1 (id),
   date date default null,
   some_data text
);

-- Create an index on the ID
CREATE INDEX ON t2 (id);

-- Create an index on the Foreign ID as well
CREATE INDEX ON t2 (t1_id);

-- Insert 2,000 test rows in t2
INSERT INTO t2 (t1_id, date, some_data)
SELECT ROUND(RANDOM() * (SELECT MAX(id) FROM t1)), -- Ramdom ID FROM t1
       (SELECT date(now() - trunc(random() * generate_series) * '1 day'::interval)), -- Random date
       md5(random()::text) AS some_data
FROM generate_series(1,2000);

Introduction

For our examples, you can recreate 2 basic tables:

  • Table "t1" contains an ID and another column called "some_data" with a random hash
  • Table "t2" contains an ID, a foreign ID from "t1" called "t1_id" (you can read more about foreign keys here), a random date called "date" and "some_data" with a random hash

Here is a sample of the data from each table:

[perso] gab@gab # SELECT * FROM t1 LIMIT 5;
 id |            some_data             
----+----------------------------------
  1 | 3c18d658a44e3a98ebe8cae59571a5ff
  2 | 18ce0a04bf6e42a3200afca254511233
  3 | e21f8a35589ba9bd36e12343c60e164d
  4 | 0f048ab9343531cc29fae8d2cd609d77
  5 | 5ba44f912e387ce4ab3f889e7030a8cd
(5 rows)
Time: 1.610 ms

[perso] gab@gab # SELECT * FROM t2 LIMIT 5;     
 id | t1_id |    date    |            some_data             
----+-------+------------+----------------------------------
  1 |  8545 | 2015-03-28 | dbe7d97388bcea4918f114721f5d6a7c
  2 |  8900 | 2015-03-28 | e9c223c20a4a8cb8e20b722930313b3f
  3 |  4757 | 2015-03-26 | 013ac690bd5a0cf24e00920fbd8f9e11
  4 |  3022 | 2015-03-28 | 128238862f4bf797de61ab9429ffb1fe
  5 |  1505 | 2015-03-28 | f152a0a670a36c012d17258023ac4d99
(5 rows)
Time: 0.499 ms

We will have 2 goals for the following demonstration:

  • Find the number of "t1_id" from "t1" in "t2"
  • Find the most recent date in "t2" corresponding to a row in "t1" These are some very basic examples written to help you understand the different ways to write subqueries. You can easily adapt them to your situations.

Basic subqueries

The most basic solution is to write 2 subqueries within the SELECT to retrieve each data (number of rows in "t2" where "t1_id" matches "t1->id" and the most recent corresponding date in "t2"):

[perso] gab@gab # SELECT t1.id, 
       (SELECT MAX(date) FROM t2 WHERE t1_id = t1.id) as max_date,
       (SELECT count(*) FROM t2 WHERE t1_id = t1.id) as total
FROM t1
ORDER BY total DESC, t1.id DESC
LIMIT 10;
  id  |  max_date  | total 
------+------------+-------
 2253 | 2015-02-20 |     5
 6572 | 2014-09-16 |     3
 6359 | 2013-06-29 |     3
 5863 | 2015-02-19 |     3
 5579 | 2015-02-22 |     3
 4327 | 2015-01-26 |     3
 4321 | 2014-09-16 |     3
 3293 | 2014-09-17 |     3
 3051 | 2015-01-28 |     3
 3017 | 2014-12-03 |     3
(10 rows)
Time: 67.188 ms

As you can see below, it is not a very efficient way to process as the engine will first scan all rows from "t1" (see "Seq Scan on t1 ([...]rows=10000")) then will try to match every row from "t2" (twice, one for each subquery) and will have to loop 10,000 times in order to do that.

                                                                       QUERY PLAN                                                                        
----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=166550.10..166550.12 rows=10 width=4) (actual time=68.870..68.873 rows=10 loops=1)
   ->  Sort  (cost=166550.10..166575.10 rows=10000 width=4) (actual time=68.869..68.871 rows=10 loops=1)
         Sort Key: ((SubPlan 2)), t1.id
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Seq Scan on t1  (cost=0.00..166334.00 rows=10000 width=4) (actual time=0.049..65.128 rows=10000 loops=1)
               SubPlan 1
                 ->  Aggregate  (cost=8.30..8.31 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=10000)
                       ->  Index Scan using t2_t1_id_idx on t2  (cost=0.28..8.29 rows=1 width=4)
                                                                         (actual time=0.002..0.002 rows=0 loops=10000)
                             Index Cond: (t1_id = t1.id)
               SubPlan 2
                 ->  Aggregate  (cost=8.30..8.31 rows=1 width=0) (actual time=0.002..0.002 rows=1 loops=10000)
                       ->  Index Only Scan using t2_t1_id_idx on t2 t2_1  (cost=0.28..8.29 rows=1 width=0)
                                                                         (actual time=0.002..0.002 rows=0 loops=10000)
                             Index Cond: (t1_id = t1.id)
                             Heap Fetches: 2000
 Planning time: 0.403 ms
 Execution time: 68.957 ms
(16 rows)

Better solutions

WITH Queries (Common Table Expressions)

Common Table Expressions, aka "WITH Queries" are my go to in PostgreSQL when I need to fetch subset of data from different tables. For the problem we are trying to solve, they offer a very simple and convenient solution:

  • We will first fetch data from the subquery
  • Then we will join this subset of data to the main query Here is a example:
[perso] gab@gab # SELECT t1_id, MAX(date) as max_date, count(*) as total
                                 FROM t2
                                 GROUP BY t1_id;
 t1_id |  max_date  | total 
-------+------------+-------
  1074 | 2013-08-27 |     1
  7494 | 2014-07-16 |     1
  8908 | 2014-12-04 |     1
  [...]
  6755 | 2014-12-09 |     2
  Time: 6.361 ms

Now that we fetched all data from "t2" grouped by "t1_id", we can use this subset of data in a "WITH Query":

[perso] gab@gab # WITH td2_sub AS (
                  SELECT t1_id, MAX(date) as max_date, count(*) as total
                  FROM t2
                  GROUP BY t1_id
)
SELECT t1.id, td2_sub.max_date, td2_sub.total
FROM t1
JOIN td2_sub ON td2_sub.t1_id = t1.id
ORDER BY td2_sub.total DESC, t1.id DESC
LIMIT 10;
  id  |  max_date  | total 
------+------------+-------
 2253 | 2015-02-20 |     5
 6572 | 2014-09-16 |     3
 6359 | 2013-06-29 |     3
 5863 | 2015-02-19 |     3
 5579 | 2015-02-22 |     3
 4327 | 2015-01-26 |     3
 4321 | 2014-09-16 |     3
 3293 | 2014-09-17 |     3
 3051 | 2015-01-28 |     3
 3017 | 2014-12-03 |     3
(10 rows)
Time: 11.163 ms

You can see that the output id is the same but that the query ran ~6 times faster.

[perso] gab@gab #                                                           QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=501.05..501.08 rows=10 width=16) (actual time=12.398..12.401 rows=10 loops=1)
   CTE td2_sub
     ->  HashAggregate  (cost=54.00..72.02 rows=1802 width=8) (actual time=2.606..3.308 rows=1802 loops=1)
           Group Key: t2.t1_id
           ->  Seq Scan on t2  (cost=0.00..39.00 rows=2000 width=8) (actual time=0.011..0.483 rows=2000 loops=1)
   ->  Sort  (cost=429.03..433.54 rows=1802 width=16) (actual time=12.396..12.397 rows=10 loops=1)
         Sort Key: td2_sub.total, t1.id
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Hash Join  (cost=309.00..390.09 rows=1802 width=16) (actual time=8.375..11.823 rows=1802 loops=1)
               Hash Cond: (td2_sub.t1_id = t1.id)
               ->  CTE Scan on td2_sub  (cost=0.00..36.04 rows=1802 width=16) (actual time=2.612..4.240 rows=1802 loops=1)
               ->  Hash  (cost=184.00..184.00 rows=10000 width=4) (actual time=5.736..5.736 rows=10000 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 352kB
                     ->  Seq Scan on t1  (cost=0.00..184.00 rows=10000 width=4) (actual time=0.009..2.577 rows=10000 loops=1)
 Planning time: 0.391 ms
 Execution time: 12.535 ms
(16 rows)

The initial WITH Query from "t2" generated 1,802 rows. We were able to match them very efficiently to "t1" without any loops.

Moving the subquery to the FROM clause

The other valid solution is to move the subquery to the FROM clause. I personally prefer WITH Queries but it happens quite often that I have to work with some RDBMS who don't support WITH Queries (who said MySQL?).

It is pretty easy to rewrite this query moving the subquery to the from clause: basically, what's in WITH xxx AS (...) needs to be moved between JOIN ... ON:

[perso] gab@gab # SELECT t1.id, td2_sub.max_date, td2_sub.total
FROM t1
JOIN  (
            SELECT t1_id, MAX(date) as max_date, count(*) as total
            FROM t2
            GROUP BY t1_id
      ) td2_sub ON td2_sub.t1_id = t1.id
ORDER BY td2_sub.total DESC, t1.id DESC
LIMIT 10;
  id  |  max_date  | total 
------+------------+-------
 2253 | 2015-02-20 |     5
 6572 | 2014-09-16 |     3
 6359 | 2013-06-29 |     3
 5863 | 2015-02-19 |     3
 5579 | 2015-02-22 |     3
 4327 | 2015-01-26 |     3
 4321 | 2014-09-16 |     3
 3293 | 2014-09-17 |     3
 3051 | 2015-01-28 |     3
 3017 | 2014-12-03 |     3
(10 rows)
Time: 14.332 ms

As you can see the processing time is about the same as writing the query with a WITH Clause (11.163 ms vs 14.332 ms). That's because PostgreSQL will use an equivalent execution plan for both of these queries.

Conclusion

We analyzed 3 methods to write subqueries and their performance. Each example gives the same exact output but the last two in which the subqueries were moved are about 6 times faster than using a regular subquery: