Window functions are one of another SQL features that we'll probably discover during the work with data-oriented application. They can be also used in more classical programs though.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in π
Early Release on the O'Reilly platform
I also help solve your data engineering problems π contact@waitingforcode.com π©
This post describes the feature of window functions in PostgreSQL database. The first section explains the general idea behind them. The second part describes the example used in our tests. The next sections list one-by-one the methods related to the window functions and shows how they can be used against some real-world-like examples.
Window functions defined
The window functions can be explained to a programmer very easily thanks to the similarities between maps data structures (Python dictionaries) and for loops. Translated to a pseudo-code it could look like:
rows = Map(KeyType, ValueType) for key, values in rows: for value in values: // β¦
That is, the window function groups the rows by some keys and iterates over them by retaining rows separate identities. As you'll discover in the 3rd section, the rows are linked through window-specific functions allowing us to access either previous iterated value, the first iterated value or even the distance from the beginning. We tell that the iterated values are scoped to the window frame.
The use of window function consists on 2 keywords: OVER and PARTITION BY. For instance, if we'd like to iterate over the values grouped by the zip_code column and compute an average age of its residents, we'd use the following instruction: ...AVG(age) OVER (PARTITION BY zip_code) FROM residents_list. As you can see, there is no magic and the definition is pretty straightforward.
But what happens if we'd like to use the same window in more than 1 place ? In order to not repeat ourselves we can use the feature called named window functions. Thanks to it we can define one window at the end of the SQL statement and reference it in different places before the definition. Such window is specified with WINDOW $name instruction. The parts using this window can still reference it with OVER keyword. The difference is that PARTITION BY is replaced by the value associated to $name variable. For instance, if we'd like to reuse a window function for our average age problem, we'd write the following: ...AVG(age) OVER zip_code_window ...WINDOW zip_code AS (PARTITION BY zip_code).
Code used in the examples
Below you can find the query creating the data used in the tests. We'll use the same Docker's PostgreSQL image as in the case of the post about correlated subqueries:
CREATE TYPE user_actions AS ENUM ('left_click', 'hover', 'select', 'scroll', 'inactive', 'right_click'); CREATE TABLE actions ( user_id INT NOT NULL, action user_actions, page VARCHAR(100) NOT NULL, action_time TIMESTAMP NOT NULL, PRIMARY KEY(user_id, action, action_time) ); INSERT INTO actions (user_id, action, page, action_time) VALUES (1, 'inactive', 'index.html', TIMESTAMP '2018-01-01 15:00:00'), (1, 'hover', 'index.html', TIMESTAMP '2018-01-01 15:00:45'), (1, 'left_click', 'index.html', TIMESTAMP '2018-01-01 15:01:15'), (1, 'select', 'index.html', TIMESTAMP '2018-01-01 15:01:20'), (1, 'right_click', 'index.html', TIMESTAMP '2018-01-01 15:01:30'), (1, 'scroll', 'index.html', TIMESTAMP '2018-01-01 15:02:15'), (1, 'left_click', 'index.html', TIMESTAMP '2018-01-01 15:02:30'), (1, 'inactive', 'about_us.html', TIMESTAMP '2018-01-01 15:03:15'), (1, 'scroll', 'about_us.html', TIMESTAMP '2018-01-01 15:05:00'), (2, 'right_click', 'about_us.html', TIMESTAMP '2018-01-01 15:03:00'), (2, 'right_click', 'about_us.html', TIMESTAMP '2018-01-01 15:03:20'), (2, 'inactive', 'about_us.html', TIMESTAMP '2018-01-01 15:03:30'), (2, 'inactive', 'about_us.html', TIMESTAMP '2018-01-01 15:03:40'), (2, 'inactive', 'about_us.html', TIMESTAMP '2018-01-01 15:03:50'), (2, 'inactive', 'about_us.html', TIMESTAMP '2018-01-01 15:04:00'), (2, 'left_click', 'about_us.html', TIMESTAMP '2018-01-01 15:04:05'), (2, 'left_click', 'about_us.html', TIMESTAMP '2018-01-01 15:04:15');
As you can see, in the next section's examples we'll try to write some sessionization-like queries making an insight on user activity on our website.
Window function examples
Because of the readability reasons, the examples are defined in the following list where each bullet represents one of available functions:
- row_number() - represents the number of given row within its partition (group). For instance, if we'd assign a number for all actions made by given user, we could write:
correlated_subqueries=# SELECT row_number() OVER (PARTITION BY user_id ORDER BY action_time ASC) AS action_number, user_id, action, page FROM actions;
And the result would be:
action_number | user_id | action | page ---------------+---------+-------------+--------------- 1 | 1 | inactive | index.html 2 | 1 | hover | index.html 3 | 1 | left_click | index.html 4 | 1 | select | index.html 5 | 1 | right_click | index.html 6 | 1 | scroll | index.html 7 | 1 | left_click | index.html 8 | 1 | inactive | about_us.html 9 | 1 | scroll | about_us.html 1 | 2 | right_click | about_us.html 2 | 2 | right_click | about_us.html 3 | 2 | inactive | about_us.html 4 | 2 | inactive | about_us.html 5 | 2 | inactive | about_us.html 6 | 2 | inactive | about_us.html 7 | 2 | left_click | about_us.html 8 | 2 | left_click | about_us.html (17 rows)
- lag(column, offset), lead(column, offset) - returns the offset column's value from the current row. For instance, if the current row is the 3rd row within partition and the offset is defined to 2, then lag will return the 1st row of given group.
The opposite function to the lag is lead. As you can deduce it gets the offset row after the current one.
In our sessionization example we'd use both in order to see the difference between 2 actions:correlated_subqueries=# SELECT action_time, lead(action_time, 1) OVER session_window - action_time diff_with_next, correlated_subqueries-# action_time - lag(action_time, 1) OVER session_window AS diff_with_previous, correlated_subqueries-# user_id, action, page FROM actions WINDOW session_window AS (PARTITION BY user_id ORDER BY action_time ASC); action_time | diff_with_next | diff_with_previous | user_id | action | page ---------------------+----------------+--------------------+---------+-------------+--------------- 2018-01-01 15:00:00 | 00:00:45 | | 1 | inactive | index.html 2018-01-01 15:00:45 | 00:00:30 | 00:00:45 | 1 | hover | index.html 2018-01-01 15:01:15 | 00:00:05 | 00:00:30 | 1 | left_click | index.html 2018-01-01 15:01:20 | 00:00:10 | 00:00:05 | 1 | select | index.html 2018-01-01 15:01:30 | 00:00:45 | 00:00:10 | 1 | right_click | index.html 2018-01-01 15:02:15 | 00:00:15 | 00:00:45 | 1 | scroll | index.html 2018-01-01 15:02:30 | 00:00:45 | 00:00:15 | 1 | left_click | index.html 2018-01-01 15:03:15 | 00:01:45 | 00:00:45 | 1 | inactive | about_us.html 2018-01-01 15:05:00 | | 00:01:45 | 1 | scroll | about_us.html 2018-01-01 15:03:00 | 00:00:20 | | 2 | right_click | about_us.html 2018-01-01 15:03:20 | 00:00:10 | 00:00:20 | 2 | right_click | about_us.html 2018-01-01 15:03:30 | 00:00:10 | 00:00:10 | 2 | inactive | about_us.html 2018-01-01 15:03:40 | 00:00:10 | 00:00:10 | 2 | inactive | about_us.html 2018-01-01 15:03:50 | 00:00:10 | 00:00:10 | 2 | inactive | about_us.html 2018-01-01 15:04:00 | 00:00:05 | 00:00:10 | 2 | inactive | about_us.html 2018-01-01 15:04:05 | 00:00:10 | 00:00:05 | 2 | left_click | about_us.html 2018-01-01 15:04:15 | | 00:00:10 | 2 | left_click | about_us.html (17 rows)
-
first_value() and last_value() - returns, correspondingly, the first value within given window frame and the last seen value within this frame. The seen is quite important here because, unlike we could think at first glance, the last_value() won't return the last value of the partition.
The following code shows this subtility:
correlated_subqueries=# SELECT action_time, LAST_VALUE(action_time) OVER session_window, first_value(action_time) OVER session_window AS first_value, user_id, action, page FROM actions correlated_subqueries-# WINDOW session_window AS (PARTITION BY user_id ORDER BY action_time ASC); action_time | last_value | first_value | user_id | action | page ---------------------+---------------------+---------------------+---------+-------------+--------------- 2018-01-01 15:00:00 | 2018-01-01 15:00:00 | 2018-01-01 15:00:00 | 1 | inactive | index.html 2018-01-01 15:00:45 | 2018-01-01 15:00:45 | 2018-01-01 15:00:00 | 1 | hover | index.html 2018-01-01 15:01:15 | 2018-01-01 15:01:15 | 2018-01-01 15:00:00 | 1 | left_click | index.html 2018-01-01 15:01:20 | 2018-01-01 15:01:20 | 2018-01-01 15:00:00 | 1 | select | index.html 2018-01-01 15:01:30 | 2018-01-01 15:01:30 | 2018-01-01 15:00:00 | 1 | right_click | index.html 2018-01-01 15:02:15 | 2018-01-01 15:02:15 | 2018-01-01 15:00:00 | 1 | scroll | index.html 2018-01-01 15:02:30 | 2018-01-01 15:02:30 | 2018-01-01 15:00:00 | 1 | left_click | index.html 2018-01-01 15:03:15 | 2018-01-01 15:03:15 | 2018-01-01 15:00:00 | 1 | inactive | about_us.html 2018-01-01 15:05:00 | 2018-01-01 15:05:00 | 2018-01-01 15:00:00 | 1 | scroll | about_us.html 2018-01-01 15:03:00 | 2018-01-01 15:03:00 | 2018-01-01 15:03:00 | 2 | right_click | about_us.html 2018-01-01 15:03:20 | 2018-01-01 15:03:20 | 2018-01-01 15:03:00 | 2 | right_click | about_us.html 2018-01-01 15:03:30 | 2018-01-01 15:03:30 | 2018-01-01 15:03:00 | 2 | inactive | about_us.html 2018-01-01 15:03:40 | 2018-01-01 15:03:40 | 2018-01-01 15:03:00 | 2 | inactive | about_us.html 2018-01-01 15:03:50 | 2018-01-01 15:03:50 | 2018-01-01 15:03:00 | 2 | inactive | about_us.html 2018-01-01 15:04:00 | 2018-01-01 15:04:00 | 2018-01-01 15:03:00 | 2 | inactive | about_us.html 2018-01-01 15:04:05 | 2018-01-01 15:04:05 | 2018-01-01 15:03:00 | 2 | left_click | about_us.html 2018-01-01 15:04:15 | 2018-01-01 15:04:15 | 2018-01-01 15:03:00 | 2 | left_click | about_us.html (17 rows)
-
nth_value(column, row_number) - it's similar to the lag and lead except the fact that it always returns the same row those number is passed in row_number parameter. An example of the use can be:
correlated_subqueries=# SELECT nth_value(action, 2) OVER session_window AS second_action, user_id, action, page FROM actions correlated_subqueries-# WINDOW session_window AS (PARTITION BY user_id ORDER BY action_time ASC); second_action | user_id | action | page ---------------+---------+-------------+--------------- | 1 | inactive | index.html hover | 1 | hover | index.html hover | 1 | left_click | index.html hover | 1 | select | index.html hover | 1 | right_click | index.html hover | 1 | scroll | index.html hover | 1 | left_click | index.html hover | 1 | inactive | about_us.html hover | 1 | scroll | about_us.html | 2 | right_click | about_us.html right_click | 2 | right_click | about_us.html right_click | 2 | inactive | about_us.html right_click | 2 | inactive | about_us.html right_click | 2 | inactive | about_us.html right_click | 2 | inactive | about_us.html right_click | 2 | left_click | about_us.html right_click | 2 | left_click | about_us.html (17 rows)
-
rank() and dense_rank() - both attribute the row numbers to rows within partition. The difference is that in rank() the "holes" can be produced. For instance, if 2 rows have the same partition key, then both will have the same rank. In consequence the next (3rd) row will be $lastRank + 2 . In the other side, dense_rank() doesn't leave the "holes" and guarantees the sequence increasing every time by 1. The difference is shown in the following query:
correlated_subqueries=# SELECT dense_rank() OVER session_window AS dense_rank_example, rank() OVER session_window AS rank_example, user_id, action, page FROM actions WINDOW session_window AS (PARTITION BY page ORDER BY user_id ASC); dense_rank_example | rank_example | user_id | action | page --------------------+--------------+---------+-------------+--------------- 1 | 1 | 1 | inactive | about_us.html 1 | 1 | 1 | scroll | about_us.html 2 | 3 | 2 | left_click | about_us.html 2 | 3 | 2 | right_click | about_us.html 2 | 3 | 2 | right_click | about_us.html 2 | 3 | 2 | inactive | about_us.html 2 | 3 | 2 | inactive | about_us.html 2 | 3 | 2 | inactive | about_us.html 2 | 3 | 2 | inactive | about_us.html 2 | 3 | 2 | left_click | about_us.html 1 | 1 | 1 | select | index.html 1 | 1 | 1 | right_click | index.html 1 | 1 | 1 | scroll | index.html 1 | 1 | 1 | left_click | index.html 1 | 1 | 1 | left_click | index.html 1 | 1 | 1 | hover | index.html 1 | 1 | 1 | inactive | index.html (17 rows)
-
ntile(number) - this function tries to split the data within given partition in even manner. For instance, if the partition has 10 rows and we decide to split it in 2 different subsets (=number=2), then ntile will take 5 first and last rows and assign them either to 1 or 2 subset. You can take a look at this example:
correlated_subqueries=# SELECT ntile(2) OVER (PARTITION BY user_id ORDER BY action_time ASC) AS subset, user_id, action, page FROM actions; subset | user_id | action | page --------+---------+-------------+--------------- 1 | 1 | inactive | index.html 1 | 1 | hover | index.html 1 | 1 | left_click | index.html 1 | 1 | select | index.html 1 | 1 | right_click | index.html 2 | 1 | scroll | index.html 2 | 1 | left_click | index.html 2 | 1 | inactive | about_us.html 2 | 1 | scroll | about_us.html 1 | 2 | right_click | about_us.html 1 | 2 | right_click | about_us.html 1 | 2 | inactive | about_us.html 1 | 2 | inactive | about_us.html 2 | 2 | inactive | about_us.html 2 | 2 | inactive | about_us.html 2 | 2 | left_click | about_us.html 2 | 2 | left_click | about_us.html (17 rows)
The window functions are quite powerful SQL feature that can be used in all kind of tasks needing some ordered sequencing. The first part shown how to define them in SQL. As we could see, the operators like OVER and PARTITION BY are the keys of the definition. We also discovered that one window can be reused thanks to the named window functions. Next, after discovering the tested dataset in the 2nd part, the 3rd one shown some of methods provided with window functions. We could learn there that window functions made possible the reading of previous and next values relative to currently read row.