Google App Engine

Paging through large datasets

Joe Gregorio
December 2008
Updated January 2013 by Danny Hermes

This is part two of a five-part series on effectively scaling your App Engine-based apps. To see the other articles in the series, see Related links.

This article was originally written for SDK version 1.1.7. The method for paging described here has been obsolete as of release 1.3.1. In this release, query cursors (Java | Python) superseded the techniques described below and are now the recommended method for paging through large datasets. A sample that shows paging with cursors is also included with the code for this article.

One of the typical web functions you may need to do is pagination, that is, displaying large datasets one page at a time. If you have too many items to display on a single web page you will display a subset of them, say 20, with a link to the next page of 20 items. For example, the comments on a blog would be paged if there are a lot of comments.

Fetch

Now you may be tempted to use fetch(limit, offset=0) to do your paging, but this has limitations. From the documentation:

Note: fetch() returns a maximum of 1000 results. If more than 1000 entities match the query, and either no limit is specified or a limit larger than 1000 is used, only the first 1000 results are returned by fetch().

So you can use fetch() only if there are going to be less than 1000 entities. Even if you have less than 1000 entities you may still want to avoid it as the time it takes grows as the value of offset increases. From the documentation:

The query has performance characteristics that correspond linearly with the offset amount plus the limit.

Paging on a property

If your application needs more than that then you will need to do something different. We will use a suggestion box as our working example. We will start with a naive implementation and slowly evolve it into a good working example of paging.

PAGE_SIZE = 10

class Suggestion(ndb.Model):
    suggestion = ndb.StringProperty()
    when = ndb.DateTimeProperty(auto_now_add=True)

The first requirement for paging is having an indexed property that you can run an inequality filter against. In this case let's construct a way to page through the suggestions in reverse chronological order. To get the first set of 10 suggestions:

def get(self):
    suggestions = Suggestion.query().order(-Suggestion.when).fetch(PAGE_SIZE)

    # ..render template..

But we need to know if we should provide a next link to the next page if there happens to be enough data for another page, so let's request PAGE_SIZE + 1 entities, and if we actually get back PAGE_SIZE + 1 entities then we know there are enough entities for another page:

def get(self):
    next_result = None
    suggestions = Suggestion.query().order(-Suggestion.when).fetch(PAGE_SIZE + 1)
    if len(suggestions) == PAGE_SIZE + 1:
        next_result = suggestions[-1].when
    suggestions = suggestions[:PAGE_SIZE]

  # ..render template..

Now when we generate our HTML for the page of suggestions we can use the value of next_result to determine whether or not to present a next link. We also need a way to handle a request for the next page, which we do by adding a query parameter bookmark that contains the value of when that we should start the next page at:

def get(self):
    next_result = None
    bookmark = self.request.get('bookmark')
    query = Suggestion.query().order(-Suggestion.when)
    if bookmark:
        query = query.filter(Suggestion.when <= bookmark)
    suggestions = query.fetch(PAGE_SIZE + 1)
    if len(suggestions) == PAGE_SIZE + 1:
        next_result = suggestions[-1].when
        suggestions = suggestions[:PAGE_SIZE]

    # ..render template..

If you are paging and using a property that is unique across all entities then you're done and the above code should work for you. Unfortunately the code presented above has a problem for this particular example because the value of the when field isn't guaranteed to be unique across all the entities, that is, it's possible that more than one suggestion could come in at the exact same time, as in this example:

offset suggestion when
... ... ...
9 Stock Jolt 2008-10-26 04:38:00
10 Allow dogs in the office 2008-10-26 03:35:58
11 Allow cats in the office 2008-10-26 03:35:58
12 Buy some multicolored exercise balls 2008-10-26 01:10:03
13 ... ...

If we present a page with the 10 most recent suggestions you would end up with 2008-10-26 04:38:02 as the place to pick up the next 10 in the list. Unfortunately #10 and #11 have the same value for when and you would get overlap, the item at spot #10 appears as the last element on the first page and also as the first element on the second page. You can see how this problem would get worse the more duplicates there were.

While we want to page in reverse chronological order, the when field may not be unique. What we need is a way to guarantee that it is unique. One way to guarantee uniqueness is to have a single counter that we increment each time we add a Suggestion entity to the datastore. The value of the counter is appended to each when value and guarantees they are unique:

offset suggestion when
... ... ...
9 Stock Jolt 2008-10-26 04:38:00|09
10
Allow dogs in the office 2008-10-26 03:35:58|10
11
Allow cats in the office 2008-10-26 03:35:58|11
12 Buy some multicolored exercise balls 2008-10-26 01:10:03|12
13 ... ...

Let's update our model to reflect this change. We will have a created property which is the date and time that the Suggestion was created and change creation_token to be a StringProperty since it will now contain the date and time concatenated with another string that makes it unique. The creation_token property is the one we will sort by when paginating.

class Suggestion(ndb.Model):
    suggestion = ndb.StringProperty()
    created = ndb.DateTimeProperty(auto_now_add=True)
    creation_token = ndb.StringProperty()

If the rate at which you add suggestions is low then this is an acceptable solution, but if they are coming in quickly then the counter ends up becoming a bottleneck as it can only be updated so quickly. We need a way to generate uniqueness but also allow quick updates, and if you've read the article on counter sharding then this is a familiar problem with a familiar solution. In this case we are going to shard our counters over the users that are adding suggestions and use the user id as part of the unique value we add to each value of creation_token.

So we need a Contributor:

class Contributor(ndb.Model):
   count = ndb.IntegerProperty(default=0)

    @classmethod
    @ndb.transactional
    def unique_id(cls, email):
        """Increments contributor suggestion count and creates unique ID."""
        contributor = cls.get_by_id(email)
        if contributor == None:
            contributor = cls(id=email)
        contributor.count += 1
        contributor.put()

        return '{}|{:d}'.format(email, contributor.count)

Now this looks good:

offset suggestion creation_token
... ... ...
9 Stock Jolt 2008-10-26 04:38:00|joe@bitworking.org|1
10 Allow dogs in the office 2008-10-26 03:35:58|fred@example.org|1
11 Allow cats in the office 2008-10-26 03:35:58|joe@bitworking.org|2
12 Buy some multicolored exercise balls 2008-10-26 01:10:03|joe@bitworking.org|3
13 ... ...

But we do have one more problem, and that is that we now expose the email address of the person making the suggestion in our creation_token value, which will get exposed in our next link URI. To get around that final problem we can obfuscate the data we append to each creation_token value by putting it through an MD5 hash:

    ...
    @classmethod
    @ndb.transactional
    def unique_id(cls, email):
        """Increments contributor suggestion count and creates unique ID."""
        ...
        md5_hash = hashlib.md5('{}|{:d}'.format(email, contributor.count))
        return md5_hash.hexdigest()
offset suggestion creation_token
... ... ...
9 Stock Jolt 2008-10-26 04:38:00|aee15ab24b7b3718596e3acce04fba85
10 Allow dogs in the office 2008-10-26 03:35:58|404a3235076f6651914358680acf3cb5
11 Allow cats in the office 2008-10-26 03:35:58|7574b989df099d4e2b95619c9cf0c2a0
12 Buy some multicolored exercise balls 2008-10-26 01:10:03|c675e87cc990a718979afecc93a77bc1
13 ... ...

Code

All the source for this article can be found in the in our samples repository, and it contains a full example of creating a unique index and also a full example of paging with cursors, which was enabled as of release 1.3.1.

Summary

In summary, to do paging you need a property that has a unique value for each entity and is indexed in the direction you want to page. Always fetch N + 1 entities to determine in you have enough data for a next page. And finally, if you don't have uniqueness in the property then you can use a single counter, or counters sharded over some attribute such as the user, to make that property unique.

Further Reading

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.