Weighted ntile with PostgreSQL

Published on Mar. 27, 2017 by Gabriel Bordeaux

Introduction

The PostgreSQL window function ntile is a ranking function dedicated to split data into a specified number of groups. It can allow you to calculate easily quartiles or deciles for example. In some specific situations, you might want to give a weight to values in order to rank data from high to low. In the example below we will rank our customers from "very big" to "very small".

As shown: invisible ntile on a scale

As shown: invisible ntile on a scale

Try it yourself

If you'd like to try the above example for yourself, just create this table:

CREATE TEMP TABLE customers (name text, number_of_orders int);

INSERT INTO customers (name, number_of_orders)
VALUES ('Adobe Systems Inc', 1), ('ADT Corp', 1), ('Allstate Corp', 1),
       ('Amazon.com Inc', 2), ('American Express Co', 2), ('AT&T Inc', 2),
       ('Bed Bath & Beyond', 2), ('Boeing Company', 2), ('Charles Schwab Corporation', 2),
       ('Delta Air Lines', 7), ('Dollar Tree', 8), ('eBay Inc.', 9), 
       ('Family Dollar Stores', 10), ('FedEx Corporation', 11), ('Ford Motor', 10),
       ('Garmin Ltd.', 10), ('General Electric', 10), ('General Mills', 10),
       ('Harley-Davidson', 10), ('Johnson & Johnson', 17), ('Keurig Green Mountain', 18),
       ('Macy''s Inc.', 19), ('McDonald''s Corp.', 20), ('MetLife Inc.', 240),
       ('Morgan Stanley', 24000);

Using ntile

Our goal is to rank our customers from 1 to 5 :

  • 1 = very small
  • 2 = small
  • 3 = normal
  • 4 = big
  • 5 = very big We can use the PostgreSQL window function ntile to split data:
gab@test> SELECT name, number_of_orders, ntile(5) OVER(ORDER BY number_of_orders)
FROM customers
ORDER BY number_of_orders;

            name            | number_of_orders | ntile 
----------------------------+------------------+-------
 Adobe Systems Inc          |                1 |     1
 ADT Corp                   |                1 |     1
 Allstate Corp              |                1 |     1
 Amazon.com Inc             |                2 |     1
 American Express Co        |                2 |     1
 AT&T Inc                   |                2 |     2
 Bed Bath & Beyond          |                2 |     2
 Boeing Company             |                2 |     2
 Charles Schwab Corporation |                2 |     2
 Delta Air Lines            |                7 |     2
 Dollar Tree                |                8 |     3
 eBay Inc.                  |                9 |     3
 Family Dollar Stores       |               10 |     3
 Ford Motor                 |               10 |     3
 Garmin Ltd.                |               10 |     3
 General Electric           |               10 |     4
 General Mills              |               10 |     4
 Harley-Davidson            |               10 |     4
 FedEx Corporation          |               11 |     4
 Johnson & Johnson          |               17 |     4
 Keurig Green Mountain      |               18 |     5
 Macy's Inc.                |               19 |     5
 McDonald's Corp.           |               20 |     5
 MetLife Inc.               |              240 |     5
 Morgan Stanley             |            24000 |     5
(25 rows)

ntile() is a great function, however, in this case it does not really help us. As you can see above, some customers with the same number of order are rated "very small" (1) and "small" (2) as well as "normal" and "big". It's normal, ntile's goal is to distribute an even number of rows in each group. In our example, each group as 5 customers. We will use the function weighted_ntile() to better answer our need.

Using weighted_ntile

You first need to create the function weighted_ntile():

CREATE OR REPLACE FUNCTION weighted_ntile(numeric, numeric[], int)
RETURNS integer AS $$
DECLARE
  current_value ALIAS FOR $1;
  array_of_values ALIAS FOR $2;
  number_of_groups ALIAS FOR $3;
  result integer;
BEGIN
         WITH array_unnest AS (
              SELECT unnest(array_of_values) as a
         ), stddev_values as (
              SELECT (AVG(a) - STDDEV_SAMP(a) * 2) as low_stddev,
                     (SELECT (AVG(a) + STDDEV_SAMP(a) * 2) FROM array_unnest) as high_stddev
              FROM array_unnest
         ), bounds AS (
              SELECT MIN(a) as lower_bound,
                     (SELECT MAX(a) FROM array_unnest WHERE a <= (SELECT high_stddev FROM stddev_values) AND a <> 0) as upper_bound
              FROM array_unnest
              WHERE a >= (SELECT low_stddev FROM stddev_values)
         ), indexes AS (
              SELECT a,
                     CASE WHEN a < (SELECT low_stddev FROM stddev_values) THEN
                               1.00
                               * (SELECT low_stddev FROM stddev_values)
                               / (SELECT abs(upper_bound) FROM bounds)
                               * 100
                          WHEN a <= (SELECT high_stddev FROM stddev_values) THEN
                               1.00
                               * a
                               / (SELECT abs(upper_bound) FROM bounds)
                               * 100
                          ELSE 100
                     END as index
              FROM array_unnest
         ), tiles AS (
              SELECT DISTINCT(round(index)) as base, NTILE(number_of_groups) OVER (ORDER BY round(index)) as tile
              FROM indexes
              GROUP BY round(index)
         )
         SELECT tile into result
         FROM tiles
         WHERE base = (SELECT round(index) FROM indexes WHERE base = round(index) AND a = current_value LIMIT 1);
         
         RETURN result;
END;
$$ LANGUAGE plpgsql;

The function works with the following arguments:

SELECT weighted_ntile(CURRENT_VALUE, ARRAY_OF_ALL_THE_VALUES, NUMBER_OF_GROUPS);

If we want to know the rank (out of 5) of a customer with 4 orders, we would use the following query:

gab@test> SELECT weighted_ntile(19, (SELECT array_agg(number_of_orders) FROM customers), 5);
 weighted_ntile 
----------------
              4
(1 row)

To find out the rank of every customer, we would use:

gab@test> SELECT name, number_of_orders, ntile(5) OVER(ORDER BY number_of_orders), 
       weighted_ntile(number_of_orders, (SELECT array_agg(number_of_orders) FROM customers), 5)
FROM customers
ORDER BY number_of_orders;

            name            | number_of_orders | ntile | weighted_ntile 
----------------------------+------------------+-------+----------------
 Adobe Systems Inc          |                1 |     1 |              1
 ADT Corp                   |                1 |     1 |              1
 Allstate Corp              |                1 |     1 |              1
 Amazon.com Inc             |                2 |     1 |              1
 American Express Co        |                2 |     1 |              1
 AT&T Inc                   |                2 |     2 |              1
 Bed Bath & Beyond          |                2 |     2 |              1
 Boeing Company             |                2 |     2 |              1
 Charles Schwab Corporation |                2 |     2 |              1
 Delta Air Lines            |                7 |     2 |              2
 Dollar Tree                |                8 |     3 |              2
 eBay Inc.                  |                9 |     3 |              2
 Family Dollar Stores       |               10 |     3 |              2
 Ford Motor                 |               10 |     3 |              2
 Garmin Ltd.                |               10 |     3 |              2
 General Electric           |               10 |     4 |              2
 General Mills              |               10 |     4 |              2
 Harley-Davidson            |               10 |     4 |              2
 FedEx Corporation          |               11 |     4 |              3
 Johnson & Johnson          |               17 |     4 |              3
 Keurig Green Mountain      |               18 |     5 |              4
 Macy's Inc.                |               19 |     5 |              4
 McDonald's Corp.           |               20 |     5 |              4
 MetLife Inc.               |              240 |     5 |              5
 Morgan Stanley             |            24000 |     5 |              5

(25 rows)

You can see below the value of "ntile" just to allow you a quick comparison. The new ranking is in weighted_ntile. As you can see, all of our customers with 2 orders or less are rated "very small" (weighted_ntile = 1), the customers between 7 and 10 orders are rated "small" (weighted_ntile = 2) etc.

What can I do with the source code of the weighted_ntile() function?

What ever you want. (learn more) Would you like to improve it? If you want, send me the updated version and if you want, I'll publish it here. (contact me)