Average ignoring extremes (outliers) with PostgreSQL

Published on Mar. 27, 2017 by Gabriel Bordeaux

Introduction

In some cases, you may want a "trimmed average" instead of a regular average. For example, if you try to calculate the average net worth in your appartment building and Bill Gates is living in the penthouse, it might throw off the data a little bit. In this specific case, it might make more sense to say that the average net worth in your apartment building excluding Bill Gates is $587,333 and that Bill Gates has a net worth of 79,100,000,000.

Average size banana

Average size banana

Try it yourself

If you'd like to try the above example for yourself, just create this table:

CREATE TEMP TABLE building (
   id serial,
   name text,
   unit_number text,
   net_worth bigint
);

INSERT INTO building (name, unit_number, net_worth)
VALUES ('Gabriel', '1a', 326000),
       ('Tom', '1b', 885000),
       ('Jason', '1c', 563000),
       ('Bob', '2a', 145000),
       ('Roger', '2b', 517000),
       ('Lucas', '2c', 937000),
       ('Jerome', '3a', 986000),
       ('Andrew', '3b', 705000),
       ('John', '3c', 179000),
       ('Paul', '4a', 863000),
       ('Richard', '4b', 467000),
       ('Joseph', '4c', 475000),
       ('Bill', 'PH', 79100000000);

Regular average

Here is the average net worth in the building as well as the lowest and maximum net worth:

gab@test> SELECT round(AVG(net_worth)) as avg, MIN(net_worth), MAX(net_worth) FROM building;
    avg     |  min   |     max     
------------+--------+-------------
 6085157538 | 145000 | 79100000000
(1 row)

Time: 0.476 ms

Average without the outliers

As you can see above, the average net worth in my apartment building is $608,515,7538 which does not really make sense because 1 tenant is way above that amount and all the other tenants under. We can use standard deviation to remove outliers and get a "trimmed average". In PostgreSQL, the function is called stddev_samp().

The "sample standard deviation" of the net worths is:

gab@test> SELECT STDDEV_SAMP(net_worth) FROM building;
 stddev_samp 
-------------
 21938229866
(1 row)

Time: 0.401 ms

You can, for example, keep only the 95% data closest to the average with this formula: (Average - Standard Deviation * 2) < DATA WE KEEP < (Average + Standard Deviation * 2)

Here is this calculation for our data:

gab@test> SELECT (AVG(net_worth) - STDDEV_SAMP(net_worth) * 2) as lower_bound,
           (AVG(net_worth) + STDDEV_SAMP(net_worth) * 2) as upper_bound
    FROM building;

      lower_bound      |     upper_bound      
-----------------------+----------------------
 -37791302193.53846154 | 49961617270.46153846
(1 row)

Time: 0.546 ms

This means that for our trimmed average, we would consider only the tenants with a net worth between $(37,791,302,193) and $49,961,617,270 which would remove Bill Gates.

Here is the trimmed average:

gab@test> WITH bounds AS (
    SELECT (AVG(net_worth) - STDDEV_SAMP(net_worth) * 2) as lower_bound,
           (AVG(net_worth) + STDDEV_SAMP(net_worth) * 2) as upper_bound
    FROM building
)
SELECT round(AVG(net_worth)) as average
FROM building
WHERE net_worth BETWEEN (SELECT lower_bound FROM bounds) AND (SELECT upper_bound FROM bounds);

 average 
---------
  587333
(1 row)

Time: 0.761 ms

The average net worth of our tenants excluding Bill Gates is $587,333.

In this example, we used "Standard Deviation * 2" to keep only the 95% data closest to the average. You can change this value to remove more or less outliers:

Percentage included in the set Percentage excluded from the set Value
50% 50% 0.674
68% 32% 0.994
68.268% 31.731% 1
80% 20% 1.281
90% 10% 1.644
95% 5% 1.959
95.449% 4.550% 2
99.730% 0.269% 3
99.999% 0.0000% 6

Read more about the rules for normally distributed data.