February 13, 2008
There's plenty of sites that keep track of the most popular of something on a daily basis, such as delicious, digg, reddit, flickr, etc. They keep track of the days most popular items, where popularity is recorded implicitly, recording a link in delicious, or explicitly, voting up and down on reddit. One of the interesting aspects of these sites is "decay", the drifting down in the rankings of older items so newer items can rise to the top. Overheard is an example application for Google App Engine that demonstrates both voting and decay, using amusing quotes as the subject. It also demonstrates optimizations such as using memcache and the AJAX Libraries API. Obviously not a world rocking application, but the principles outlined can be applied to many different kinds of applications.
Our basic model will be to have Quotes, which contain a string for the quotation, and Votes, which contain the user name and vote for a particular user.
Here are the models for the program:
class Quote(db.Model): """Storage for a single quote and its metadata Properties quote: The quote as a string uri: An optional URI that is the source of the quotation rank: A calculated ranking based on the number of votes and when the quote was added. created: When the quote was created, recorded in the number of days since the beginning of our local epoch. creation_order: Totally unique index on all quotes in order of their creation. creator: The user that added this quote. """ quote = db.StringProperty(required=True, multiline=True) uri = db.StringProperty() rank = db.StringProperty() created = db.IntegerProperty(default=0) creation_order = db.StringProperty(default=" ") votesum = db.IntegerProperty(default=0) creator = db.UserProperty() class Vote(db.Model): """Storage for a single vote by a single user on a single quote. Index key_name: The email address of the user that voted. parent: The quote this is a vote for. Properties vote: The value of 1 for like, -1 for dislike. """ vote = db.IntegerProperty(default=0) class Voter(db.Model): """Storage for metadata about each user Properties count: An integer that gets incremented with users addition of a quote. Used to build a unique index for quote creation. hasVoted: Has this user ever voted on a quote. hasAddedQuote: Has this user ever added a quote. """ count = db.IntegerProperty(default=0) hasVoted = db.BooleanProperty(default=False) hasAddedQuote = db.BooleanProperty(default=False)
One of the most common optimizations done in App Engine to create a
performant application is to precompute values at write time as opposed to
read time. We will do that in this case and will calculate the vote total and
rank for a Quote each time a vote is cast. That slows down writes just a
little, but vastly speeds up reading, and makes our searches possible. That
is, if we want to present a page with the Quotes ordered by their ranking, we
rank property that is up to date to sort on. If we want to keep the
vote count consistent we are going to have to add in a vote and modify the
sum of votes for that quote in the same transaction. For that to work both
the quote and the vote need to be in the same entity group, and we accomplish
that by making each vote a child of its associated quote. There's more
details in the documentation on entity
groups and transactions.
With each vote a child of its associated quote we can put voting into a
transaction where the
are updated at the same time that the users vote is recorded.
def set_vote(quote_id, user, newvote): """ Record 'user' casting a 'vote' for a quote with an id of 'quote_id'. The 'newvote' is usually an integer in [-1, 0, 1]. """ if user is None: return email = user.email() def txn(): quote = Quote.get_by_id(quote_id) vote = Vote.get_by_key_name(key_names = user.email(), parent = quote) if vote is None: vote = Vote(key_name = user.email(), parent = quote) if vote.vote == newvote: return quote.votesum = quote.votesum - vote.vote + newvote vote.vote = newvote # See the docstring of main.py for an explanation of # the following formula. quote.rank = "%020d|%s" % ( long(quote.created * DAY_SCALE + quote.votesum), quote.creation_order ) db.put([vote, quote]) memcache.set("vote|" + user.email() + "|" + str(quote_id), vote.vote) db.run_in_transaction(txn) _set_progress_hasVoted(user)
Note that we need to specify the parent of the vote when querying for it, and we also need to specify the parent when creating a Vote.
The value of the
Quote.rank property is what we are focused on now. The
is what we will sort by to show the most popular items and it will need
to be calculated in a way that allows newer items to eventually float
to the top of the list even though older items may have received more
votes, otherwise our site will become very boring very fast. We don't
have background tasks that can go back and adjust the current rankings
of quotes over time, so we need a way to rank quotes that puts fresher
quotes higher in the ranking in a scalable way that will work for a
long period of time.
The insight here is that we can add
weight to a vote based on
when the Quote was added to the system. The rank for each quote is
rank = created * DAY_SCALE + votesum
Where the variables have the following meanings:
- Number of days after 1/1/2008 that the Quote was created.
- Sum of all +1 and -1 votes for a quote.
- This is a constant that determines how quickly votes should decay (defaults to 4). Another way to think about this is how many votes one more day is worth.
For 'created' we just pick a date in the past as our day zero and
start counting forward from there.
DAY_SCALE with a value of four means
that a vote for a Quote one day fresher will be worth four more votes.
That is, if Quote 1 and 2 have the same number of votes but Quote 2 was
added a day later then its rank will be 4 larger.
Does this work? Presume the following scenario:
Quote 0 and 1 are added on Day 1 and get 5 and 3 votes respectively. Order by rank is [Quote 0, Quote 1]. Quote Votes Rank -------------------------------- 0 (5) 1 * 4 + 5 = 9 1 (3) 1 * 4 + 3 = 7
Quote 0 and 1 get 3 and 0 more votes respectively. Quote 2 is added and gets 3 votes. Order by rank is now [Quote 0, Quote 2, Quote 1] Quote Votes Rank -------------------------------------- 0 (5) + (3) 1 * 4 + 8 = 12 1 (3) + (0) 1 * 4 + 3 = 7 2 (3) 2 * 4 + 3 = 11
Quote 2 gets two more vote. Quote 3 is added and gets 5 votes. Order by rank is [Quote 3, Quote 2, Quote 0, Quote 1] Quote Votes Rank ------------------------------------------- 0 (5) + (3) 1 * 4 + 8 = 12 1 (3) + (0) 1 * 4 + 3 = 7 2 (3) + (2) 2 * 5 + 4 = 14 3 (5) 3 * 4 + 5 = 17
Note that ties are possible, which means that rank for quotes will have to be disambiguated since the application allows paging of ranked quotes. That means for each rank we need to append information that will make it unique. In this case we will keep one counter per user and increment it each time that user adds a new quote to the system. The rank plus the user id and counter value is guaranteed to be unique for each quote.
Here is the code for generating a unique value based on the user:
def _unique_user(user): """ Creates a unique string by using an increasing counter sharded per user. The resulting string is hashed to keep the users email address private. """ def txn(): voter = _get_or_create_voter(user) voter.count += 1 voter.hasAddedQuote = True voter.put() return voter.count count = db.run_in_transaction(txn) return hashlib.md5(user.email() + "|" + str(count)).hexdigest()
For more information on paging see the article "How-To Do Paging on App Engine."
One of the most important design principles to keep in mind in building a performant App Engine application is to avoid contention, trying to update a single entity or entity group too quickly. Any single entity or entity group can only be updated about five times a second. Since a single quote and its associated votes form a single entity group we only need to consider how often any one quote is updated when looking for contention. In this application we'll assume that on the high side there will be a couple thousand votes per quote over the course of a day, which averages out to much less than five votes per second, so we are in good shape.
There is a memcache API that is very useful for speeding up an application and avoiding hitting the datastore. One of the most common queries in the Overhead application is pulling out each user's votes for a list of quotes so they are reflected in the user interface. We can cache the results in memcache and once there we avoid a datastore hit as long as the data is available in the cache. Here is the code that handles looking up votes and caching the results:
def voted(quote, user): """Returns the value of a users vote on the specified quote, a value in [-1, 0, 1].""" val = 0 if user: memcachekey = "vote|" + user.email() + "|" + str(quote.key().id()) val = memcache.get(memcachekey) if val is not None: return val vote = Vote.get_by_key_name(key_names = user.email(), parent = quote) if vote is not None: val = vote.vote memcache.set(memcachekey, val) return val
AJAX Libraries API
The last optimization you can make is to create a clean user interface for your application in HTML 5. Doing so will allow your application to run not only on desktops but also on the latest mobile devices. That saves you effort from having to build a custom client for multiple mobile devices, and that should be a huge time savings.
The Apache 2 licensed source for this application is available in the appengine-overheard-python project on GitHub. A live version of this sample application is running at http://just-overheard-it.appspot.com.