Google App Engine

Just Overheard It: An example application

Joe Gregorio
February 13, 2008

Overview

There's plenty of sites that keep track of the most popular of something on a daily basis, such as delicious, digg, reddit, flickr, etc. They keep track of the days most popular items, where popularity is recorded implicitly, recording a link in delicious, or explicitly, voting up and down on reddit. One of the interesting aspects of these sites is "decay", the drifting down in the rankings of older items so newer items can rise to the top. Overheard is an example application for Google App Engine that demonstrates both voting and decay, using amusing quotes as the subject. It also demonstrates optimizations such as using memcache and the AJAX Libraries API. Obviously not a world rocking application, but the principles outlined can be applied to many different kinds of applications.

Our basic model will be to have Quotes, which contain a string for the quotation, and Votes, which contain the user name and vote for a particular user.

Here are the models for the program:

Models

class Quote(db.Model):
    """Storage for a single quote and its metadata

    Properties
      quote:          The quote as a string
      uri:            An optional URI that is the source of the quotation
      rank:           A calculated ranking based on the number of votes and
                         when the quote was added.
      created:        When the quote was created, recorded in the number of
                         days since the beginning of our local epoch.
      creation_order: Totally unique index on all quotes in order of their
                         creation.
      creator:        The user that added this quote.
    """
    quote = db.StringProperty(required=True, multiline=True)
    uri   = db.StringProperty()
    rank = db.StringProperty()
    created = db.IntegerProperty(default=0)
    creation_order = db.StringProperty(default=" ")
    votesum = db.IntegerProperty(default=0)
    creator = db.UserProperty()


class Vote(db.Model):
    """Storage for a single vote by a single user on a single quote.


    Index
      key_name: The email address of the user that voted.
      parent:   The quote this is a vote for.

    Properties
      vote: The value of 1 for like, -1 for dislike.
    """
    vote = db.IntegerProperty(default=0)


class Voter(db.Model):
    """Storage for metadata about each user

    Properties
      count:          An integer that gets incremented with users addition of a quote.
                        Used to build a unique index for quote creation.
      hasVoted:       Has this user ever voted on a quote.
      hasAddedQuote:  Has this user ever added a quote.
    """
    count = db.IntegerProperty(default=0)
    hasVoted = db.BooleanProperty(default=False)
    hasAddedQuote = db.BooleanProperty(default=False)

Votes

One of the most common optimizations done in App Engine to create a performant application is to precompute values at write time as opposed to read time. We will do that in this case and will calculate the vote total and rank for a Quote each time a vote is cast. That slows down writes just a little, but vastly speeds up reading, and makes our searches possible. That is, if we want to present a page with the Quotes ordered by their ranking, we need a rank property that is up to date to sort on. If we want to keep the vote count consistent we are going to have to add in a vote and modify the sum of votes for that quote in the same transaction. For that to work both the quote and the vote need to be in the same entity group, and we accomplish that by making each vote a child of its associated quote. There's more details in the documentation on entity groups and transactions.

With each vote a child of its associated quote we can put voting into a transaction where the Quote.votesum and Quote.rank are updated at the same time that the users vote is recorded.

models.set_vote()

def set_vote(quote_id, user, newvote):
    """
    Record 'user' casting a 'vote' for a quote with an id of 'quote_id'.
    The 'newvote' is usually an integer in [-1, 0, 1].
    """
    if user is None:
        return
    email = user.email()

    def txn():
        quote = Quote.get_by_id(quote_id)
        vote = Vote.get_by_key_name(key_names = user.email(), parent = quote)
        if vote is None:
            vote = Vote(key_name = user.email(), parent = quote)
        if vote.vote == newvote:
            return
        quote.votesum = quote.votesum - vote.vote + newvote
        vote.vote = newvote
        # See the docstring of main.py for an explanation of
        # the following formula.
        quote.rank = "%020d|%s" % (
            long(quote.created * DAY_SCALE + quote.votesum),
            quote.creation_order
          )
        db.put([vote, quote])
        memcache.set("vote|" + user.email() + "|" + str(quote_id), vote.vote)

    db.run_in_transaction(txn)
    _set_progress_hasVoted(user)

Note that we need to specify the parent of the vote when querying for it, and we also need to specify the parent when creating a Vote.

Decay

The value of the Quote.rank property is what we are focused on now. The rank is what we will sort by to show the most popular items and it will need to be calculated in a way that allows newer items to eventually float to the top of the list even though older items may have received more votes, otherwise our site will become very boring very fast. We don't have background tasks that can go back and adjust the current rankings of quotes over time, so we need a way to rank quotes that puts fresher quotes higher in the ranking in a scalable way that will work for a long period of time.

The insight here is that we can add weight to a vote based on when the Quote was added to the system. The rank for each quote is calculated as:

rank = created * DAY_SCALE + votesum

Where the variables have the following meanings:

created
Number of days after 1/1/2008 that the Quote was created.
votesum
Sum of all +1 and -1 votes for a quote.
DAY_SCALE
This is a constant that determines how quickly votes should decay (defaults to 4). Another way to think about this is how many votes one more day is worth.

For 'created' we just pick a date in the past as our day zero and start counting forward from there. DAY_SCALE with a value of four means that a vote for a Quote one day fresher will be worth four more votes. That is, if Quote 1 and 2 have the same number of votes but Quote 2 was added a day later then its rank will be 4 larger.

Does this work? Presume the following scenario:

Day 1

 Quote 0 and 1 are added on Day 1 and
 get 5 and 3 votes respectively.
 Order by rank is [Quote 0, Quote 1].

 Quote Votes Rank
 --------------------------------
 0 (5) 1 * 4 + 5 = 9
 1 (3) 1 * 4 + 3 = 7

Day 2

 Quote 0 and 1 get 3 and 0 more votes respectively.
 Quote 2 is added and gets 3 votes.
 Order by rank is now [Quote 0, Quote 2, Quote 1]

 Quote Votes Rank
 --------------------------------------
 0 (5) + (3) 1 * 4 + 8 = 12
 1 (3) + (0) 1 * 4 + 3 = 7
 2 (3) 2 * 4 + 3 = 11

Day 3

 Quote 2 gets two more vote. Quote 3 is added and gets 5 votes.
 Order by rank is [Quote 3, Quote 2, Quote 0, Quote 1]

 Quote Votes Rank
 -------------------------------------------
 0 (5) + (3) 1 * 4 + 8 = 12
 1 (3) + (0) 1 * 4 + 3 = 7
 2 (3) + (2) 2 * 5 + 4 = 14
 3 (5) 3 * 4 + 5 = 17

Note that ties are possible, which means that rank for quotes will have to be disambiguated since the application allows paging of ranked quotes. That means for each rank we need to append information that will make it unique. In this case we will keep one counter per user and increment it each time that user adds a new quote to the system. The rank plus the user id and counter value is guaranteed to be unique for each quote.

Here is the code for generating a unique value based on the user:

models._unique_user()

def _unique_user(user):
    """
    Creates a unique string by using an increasing
    counter sharded per user. The resulting string
    is hashed to keep the users email address private.
    """

    def txn():
        voter = _get_or_create_voter(user)
        voter.count += 1
        voter.hasAddedQuote = True
        voter.put()
        return voter.count

    count = db.run_in_transaction(txn)

    return hashlib.md5(user.email() + "|" + str(count)).hexdigest()

For more information on paging see the article "How-To Do Paging on App Engine."

Contention

One of the most important design principles to keep in mind in building a performant App Engine application is to avoid contention, trying to update a single entity or entity group too quickly. Any single entity or entity group can only be updated about five times a second. Since a single quote and its associated votes form a single entity group we only need to consider how often any one quote is updated when looking for contention. In this application we'll assume that on the high side there will be a couple thousand votes per quote over the course of a day, which averages out to much less than five votes per second, so we are in good shape.

Using memcache

There is a memcache API that is very useful for speeding up an application and avoiding hitting the datastore. One of the most common queries in the Overhead application is pulling out each user's votes for a list of quotes so they are reflected in the user interface. We can cache the results in memcache and once there we avoid a datastore hit as long as the data is available in the cache. Here is the code that handles looking up votes and caching the results:

def voted(quote, user):
    """Returns the value of a users vote on the specified quote, a value in [-1, 0, 1]."""
    val = 0
    if user:
        memcachekey = "vote|" + user.email() + "|" + str(quote.key().id())
        val = memcache.get(memcachekey)
        if val is not None:
            return val
        vote = Vote.get_by_key_name(key_names = user.email(), parent = quote)
        if vote is not None:
            val = vote.vote
            memcache.set(memcachekey, val)
    return val

AJAX Libraries API

There are other optimizations that you can also do to save your application bandwidth. One of the easiest is to have Google host your Javascript library for you. The interactive part of the web pages for Overheard is done using jQuery and instead of adding the jQuery source to the application I employed Google's AJAX Libraries API. The AJAX Libraries API is a content distribution network and loading architecture for the most popular, open source JavaScript libraries. While this application uses jQuery you can also benefit if you use Prototype, script.aculo.us, Dojo, or any of the other JavaScript libraries in the growing list that the AJAX Libraries API supports.

HTML5

The last optimization you can make is to create a clean user interface for your application in HTML 5. Doing so will allow your application to run not only on desktops but also on the latest mobile devices. That saves you effort from having to build a custom client for multiple mobile devices, and that should be a huge time savings.

Summary

The Apache 2 licensed source for this application is available in the appengine-overheard-python project on GitHub. A live version of this sample application is running at http://just-overheard-it.appspot.com.

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.