Tuesday, February 26, 2008

Why paging database queries doesn't work

At work, we have several (Java) applications where there is a need to 'page' the data from the database (using hibernate and MySQL). With paging I mean that from all the interesting records, only a few have to be processed or displayed at a time. This requirement often comes from GUI design, but equally often it is required for reasons of performance and memory usage. For example we have this application that generates all kinds of reports at regular intervals based on information it obtains over some remote interface from different remote applications. The amounts of data we have, prevent us from not paging these queries.

The question is how this paging can be implemented. We also require every page to be of the same size (except obviously the last one).

We use hibernate, but the problem is the same when using plain SQL, so I'll stick to SQL in the examples. As a running example, let's assume a one-to-many mapping from a table 'a' to 'b'.
create table a (
   id integer auto_increment, 
   primary key (id));
create table b (
   id integer auto_increment, 
   a_id integer, 
   primary key(id));

A first naive attempt

select * from a 
    inner join b on b.a_id = a.id 
    order by a.id limit x,y;
This query uses the MySQL limit x,y construct, meaning "return y results, starting with result number x".

Unfortunately it doesn't work as expected. Due to the join, the above query returns a result for every b, not for every a. This means the paging will be plain wrong.

A sub query approach

select * from a 
    inner join b on b.a_id = a.id 
    where a.id in
    select id from a order by id limit x,y
In this query, we try to circumvent the problem by paging on a sub query without joins, and then joining in another query for all the results obtained from the paged query. Unfortunately this doesn't work either, because MySQL doesn't allow the limit construct inside a sub query inside an in clause.

Variation on the same idea

The query and sub query can be split into two distinct queries. First we query the primary keys of all a's in a certain page:
select id from a 
    order by id limit x,y;
Then we use that list of primary keys in a second query:
select * from a 
    inner join b on b.a_id = a.id 
    where a.id in 
    (list of ID's from the previous query);
This actually works, but now the size of the second query is proportional to the page size. This has crashed our MySQL clustered database if the pages grow over 1000 records.

A no-join solution

Another obviously working solution is to not join at all:
select * from a 
    order by id limit x,y;
If necessary, the b records can be fetched from the database with subsequent queries. If, in the context of a certain application, all these b records are going to be needed anyway, this has a tremendous performance impact because in stead of only 1 (fairly complex) query, we now have to do n + 1 (simple) queries.

Dropping the fixed page size requirement

If we drop the requirement that every page (except the last one) should be of the same size, we can page the query without relying on the MySQL limit construct. This allows us to page on actual well chosen constraints, rather than on the number of results that happens to be the consequence of a certain set of joins in a query. For example, we could page on the primary key:
select * from a 
    inner join b on b.a_id = a.id 
    where a.id > x and a.id <= x+y;
With an auto increment primary key on a table where deletes are not very frequent, this might yield good results. However, if from time to time many records are being deleted from the database, this might result in smaller or even empty pages.


This leads me to a rather pessimistic conclusion: paged database queries are not generally possible with a single well performing query, at least not with MySQL 5.

No comments: