04 February, 2023
Fast pagination with filtering and ordering in Postgres
As part of our payables product at Nanumo we have a need for pagination, sorting and filtering data throughout the product. There are already many articles that cover keyset pagination with Postgres but they're all quite light on implementation details. It may be the case that the implementation is trivial for most but it took me a few hours to nail it and hopefully my contribution to this problem helps someone else.
The basics of pagination are covered at length elsewhere but the fundamental issue is that offset pagination implemented thusly:
select * from transactions order by created_at desc limit <page_size> offset 100
incurs a performance cost proportional to the size of the offset which can be considerable for large offsets. This is because Postgres has to construct the rowset up to and including the desired page which can be large.
One way around this is to use keyset pagination; instead of indexing pages based on an offset, if the rowset is ordered on a unique column we can use values from that column to index a page. As long as there's a sortable index on the column in question, given a value from the column we can access the corresponding page in constant time.
We do lose out on the ability to jump to page $p$ of the results but for most applications this isn't an issue. Generally users want the first page and I would guess that in general user desire to access page $p$ decreases quickly as $p$ increases.
Keyset pagination works by generating a total ordering over the table by sorting over the column(s) of interest and a
unique column. In Postgres the
order by statements isn't stable and this means that we need to guarantee a total
ordering the result set to ensure users see a stable ordering. We do this by including a unique column as the lowest-priority
column in all
order by statements. The primary key makes a good choice here.
To fetch a page of results we do something like:
select * from transactions where created_at, id >= '<cursor_created_at>', '<cursor_id>' order by created_at, id limit <page_size>
And most guides that I've seen finish here but this leaves open the question of how retrieve cursors for the next and previous pages (in order to power the forward and back buttons on the UI) and how multi-column sorting can be implemented.
It is possible that for most of you these are trivial considerations. At least for me this was not the case and I thought I'd detail the missing pieces in case they're useful for you too. Specifically I wanted to figure out:
1.A fast way to calculate the current page, the cursor position for the next page and the cursor position for the previous page, all in one query 1.Whether I could easily extend the query to support arbitrary sorting and filtering
The basic idea is to fetch the current page and cursor in one CTE (which are adjacent in the rowset), the previous page cursor in another CTE, union the results, label them accordingly (previous_cursor, current_page, next_cursor) and return them, all as part of one query.
The first step is to filter the data in the table. We do not materialize the
filtered_rows expression which forces
the expression to be merged with dependent queries
(see the documentation) for more on this. In doing so we
incur the additional cost of re-calculating the
filtered_rows query for every dependent query but gain the ability to
make use of the indicies on the table. This greatly speeds up the coming operations.
To fetch the cursor we select the
id column along with any columns involved in the sort. In this example we're
created_at but it's trivial to sort on multiple columns if required (more on this later). Note that we
sort by the
id column as lowest priority.
cursor as (select created_at, id from rows where id = ? order by created_at, id limit 1)
Fetching the current page and cursor for the next page is straightforward. The key thing to note is the where clause that selects only the rows that succeed the cursor row in the total ordering and the order and limit clauses which combine to select only the rows immediately subsequent to the cursor, forming the page and the next cursor:
current_page_and_next_cursor as (select * from rows where (created_at, id) >= (select * from cursor) order by created_at, id limit <page_size + 1> )
Fetching the previous page is similarly straightforward. The main things to notice are the reversal of the equality
condition in the where clause (as we want to select the rows that precede the cursor) and the reversal of the
clause to ensure that we select the $n$ rows that precede the cursor (instead of the first $n$ rows in the result set).
We'll select just the first row from this set when we combine the results to yield the cursor for the previous page.
previous_page as (select * from rows where (created_at, id) < (select * from cursor) order by created_at desc, id desc limit <page_size> ),
With the relevant CTEs constructed the final task is to union the rows together and label them:
(select *, 'previous_cursor' as label from previous_page order by created_at, id limit 1) union all (select *, 'current_page' as label from current_page_and_next_cursor order by created_at, id limit <page_size>) union all (select *, 'next_cursor' as label from current_page_and_next_cursor order by created_at, id limit 1 offset <page_size>);
A few things to mention here. Firstly the use of
union all to preserve the ordering of the result set.
Secondly I'm again using offsets and limits to select the single row from the previous_page and
current_page_and_next_cursor CTEs that represents the previous page cursor and next page cursor respectively. Finally
a label is added to identify the relevant rows to the application. Cursors are not returned in cases where either the
previous page doesn't exist (i.e. when the cursor is pointing to the first row in the result set) or the next page
doesn't exist (i.e. when the cursor is pointing at a row within $pageSize$ distance of the last row).
Sorting on multiple columns
Sometimes users might want to sort results on more than one column e.g. by date, then by customer. This is trivial to
support and involves modifying the above query by replacing instances of
order by created_at, id with
order by created_at, customer_id, id. I couldn't figure out a nice way to do this Postgres but if you have to
(and with care taken to sanitise the inputs) this is straightforward to implement with string interpolation when
building the query in the application.
On large tables indices will be required for good performance. For each sortable column, a multicolumn index on
(sortable_column, unique_column) will speed up all queries that sort on it. I've not had time to verify the behaviour
when sorting on a secondary column but a quick look at some query plans suggests that the database is able
to make use of an index on
(secondary_sort_column, id) when running a query that uses 2 columns for sorting.
Let me know if you found this useful (or have any corrections and I'll update and credit accordingly).