16 May, 2023

Fast keyset pagination with filtering and ordering in Postgres

As part of our payables product at Nanumo we have a need for pagination, sorting and filtering data throughout the product. There are already many articles that cover keyset pagination with Postgres but they're all quite light on implementation details. It may be the case that the implementation is trivial for most but it took me a few hours to nail it and hopefully my contribution to this problem helps someone else.

The problem

The basics of pagination are covered at length elsewhere but the fundamental issue is that offset pagination implemented thusly:

select * from transactions
order by created_at desc
limit <page_size> offset 100

incurs a performance cost proportional to the size of the offset which can be considerable for large offsets. This is because Postgres has to construct the rowset up to and including the desired page which can be large.

One way around this is to use keyset pagination; instead of indexing pages based on an offset, if the rowset is ordered on a unique column we can use values from that column to index a page. As long as there's a sortable index on the column in question, given a value from the column we can access the corresponding page in constant time.

We do lose out on the ability to jump to page p of the results but for most applications this isn't an issue. Generally users want the first page and I would guess that in general user desire to access page p decreases quickly as p increases.

Keyset pagination works by generating a total ordering over the table by sorting over the column(s) of interest and a unique column. In Postgres the order by statements isn't stable and this means that we need to guarantee a total ordering the result set to ensure users see a stable ordering. We do this by including a unique column as the lowest-priority column in all order by statements. The primary key makes a good choice here.

To fetch a page of results we do something like:

select * from transactions
    where created_at, id >= '<cursor_created_at>', '<cursor_id>'
order by created_at, id
limit <page_size>

And most guides that I've seen finish here but this leaves open the question of how retrieve cursors for the next and previous pages (in order to power the forward and back buttons on the UI) and how multi-column sorting can be implemented.

It is possible that for most of you these are trivial considerations. At least for me this was not the case and I thought I'd detail the missing pieces in case they're useful for you too. Specifically I wanted to figure out:

1.A fast way to calculate the current page, the cursor position for the next page and the cursor position for the previous page, all in one query 1.Whether I could easily extend the query to support arbitrary sorting and filtering

The basic idea is to fetch the current page and cursor in one CTE (which are adjacent in the rowset), the previous page cursor in another CTE, union the results, label them accordingly (previous_cursor, current_page, next_cursor) and return them, all as part of one query.

The first step is to filter the data in the table. We do not materialize the filtered_rows expression which forces the expression to be merged with dependent queries (see the documentation) for more on this. In doing so we incur the additional cost of re-calculating the filtered_rows query for every dependent query but gain the ability to make use of the indicies on the table. This greatly speeds up the coming operations.

To fetch the cursor we select the id column along with any columns involved in the sort. In this example we're ordering on created_at but it's trivial to sort on multiple columns if required (more on this later). Note that we sort by the id column as lowest priority.

cursor as
    (select created_at, id from rows where id = ? order by created_at, id limit 1)

Fetching the current page and cursor for the next page is straightforward. The key thing to note is the where clause that selects only the rows that succeed the cursor row in the total ordering and the order and limit clauses which combine to select only the rows immediately subsequent to the cursor, forming the page and the next cursor:

current_page_and_next_cursor as
    (select *
        from rows
    where (created_at, id) >= (select * from cursor)
    order by created_at, id
    limit <page_size + 1>
)

Fetching the previous page is similarly straightforward. The main things to notice are the reversal of the equality condition in the where clause (as we want to select the rows that precede the cursor) and the reversal of the order by clause to ensure that we select the n rows that precede the cursor (instead of the first n rows in the result set). We'll select just the first row from this set when we combine the results to yield the cursor for the previous page.

previous_page as
    (select *
        from rows
    where (created_at, id) < (select * from cursor)
    order by created_at desc, id desc
    limit <page_size>
),

With the relevant CTEs constructed the final task is to union the rows together and label them:

(select *, 'previous_cursor' as label from previous_page order by created_at, id limit 1)
union all
(select *, 'current_page' as label from current_page_and_next_cursor order by created_at, id limit <page_size>)
union all
(select *, 'next_cursor' as label from current_page_and_next_cursor order by created_at, id limit 1 offset <page_size>);

A few things to mention here. Firstly the use of union all to preserve the ordering of the result set. Secondly I'm again using offsets and limits to select the single row from the previous_page and current_page_and_next_cursor CTEs that represents the previous page cursor and next page cursor respectively. Finally a label is added to identify the relevant rows to the application. Cursors are not returned in cases where either the previous page doesn't exist (i.e. when the cursor is pointing to the first row in the result set) or the next page doesn't exist (i.e. when the cursor is pointing at a row within pageSize distance of the last row).

Sorting on multiple columns

Sometimes users might want to sort results on more than one column e.g. by date, then by customer. This is trivial to support and involves modifying the above query by replacing instances of order by created_at, id with order by created_at, customer_id, id. I couldn't figure out a nice way to do this Postgres but if you have to (and with care taken to sanitise the inputs) this is straightforward to implement with string interpolation when building the query in the application.

Indicies

On large tables indices will be required for good performance. For each sortable column, a multicolumn index on (sortable_column, unique_column) will speed up all queries that sort on it. I've not had time to verify the behaviour when sorting on a secondary column but a quick look at some query plans suggests that the database is able to make use of an index on (secondary_sort_column, id) when running a query that uses 2 columns for sorting.

In summary

Let me know if you found this useful (or have any corrections and I'll update and credit accordingly).

© 2023 Henry Course