Google App Engine

Accessing the datastore remotely with remote_api

Nick Johnson
February 27, 2009
Updated April 2011 by Johan Euphrosine

Note:This article uses the Python runtime. Documentation is also available for the Java runtime

Introduction

Often when developing or supporting an App Engine application, it is useful to be able to manipulate the datastore in ways that are not well suited to the request/response model that works so well for serving Web Applications. Previously, doing these sort of operations has entailed workarounds such as app3 or App Rocket. Starting with release 1.1.9 of the App Engine SDK, however, there's a new way to interact with the datastore, in the form of the remote_api module. This module allows remote access to the App Engine datastore, using the same APIs you know and love from writing App Engine Apps.

In this article, we'll introduce you to the remote_api module, describe its basic functionality, and show you how to get an interactive console with access to your app's datastore. Then, we'll give an overview of the limitations of the remote_api module. Finally, we'll walk through a more sophisticated example: an implementation of the 'Map' part of a map/reduce operation, allowing you to execute a function on every entity of a kind.

About the remote_api module

The remote_api module consists of two parts: A 'handler', which you install on the server to handle remote datastore requests, and a 'stub', which you set up on the client to translate datastore requests into calls to the remote handler. remote_api works at the lowest level of the datastore, so once you've set up the stub, you don't have to worry about the fact that you're operating on a remote datastore: With a few caveats, it works exactly the same as if you were accessing the datastore directly.

Installing the handler is easy. Simply add the following lines to your app.yaml under the 'handlers' key:

builtins:
- remote_api: on

This installs the remote_api handler under the url '/_ah/remote_api.

Once you've updated the app.yaml file, you'll need to execute appcfg.py update for your app to upload the new mapping.

Running the remote api shell

By running remote_api_shell.py from the command line, you can interact with a Python shell that has access to your production datastore.

Since you will probably want to access modules defined by your app, such as the model definitions, we need to make sure that your app is in the Python path. The easiest way to do this is to change directory to your app's root directory (the one with app.yaml in it) before running the remote api shell. Then, just execute:

python $GAE_SDK_ROOT/remote_api_shell.py -s your_app_id.appspot.com

Replace 'myapp' with the app ID of your app, and you should get a Python interactive console prompt.

For demonstration purposes, we'll use the Guestbook app from the Getting Started documentation. Assuming we're in the root directory for the guestbook app, issue:

>>> import helloworld
>>> from google.appengine.ext import db

Now we have access to the contents of the guestbook app, we can issue commands just like we would if we were writing code to run on the server:

>>> # Fetch the most recent 10 guestbook entries
>>> entries = helloworld.Greeting.all().order("-date").fetch(10)
>>>
>>> # Create our own guestbook entry
>>> helloworld.Greeting(content="A greeting").put()

In general, the console will act exactly as if you were accessing the datastore directly, but because the script is running on your own machine, you don't have to worry about how long it takes to run, and you can access all the files and resources on your local machine as you normally would!

Limitations of remote_api

The remote_api module goes to great lengths to make sure that as far as possible, it behaves exactly like the native App Engine datastore. In some cases, this means doing things that are less efficient than they might otherwise be. When using remote_api, here's a few things to keep in mind:

Every datastore request requires a round-trip

Since you're accessing the datastore over HTTP, there's a bit more overhead and latency than when you access it locally. In order to speed things up and decrease load, try to limit the number of round-trips you do by batching gets and puts, and fetching batches of entities from queries. This is good advice not just for remote_api, but for using the datastore in general, since a batch operation is only considered to be a single Datastore operation. For example, instead of this:

for key in keys:
    rec = MyModel.get(key)
    rec.foo = bar
    rec.put()

you can do this:

records = MyModel.get(keys)
for rec in records:
    rec.foo = bar
db.Put(records)

Both examples have the same effect, but the latter requires only two roundtrips in total, while the former requires two roundtrips for each entity.

Requests to remote_api use quota

Since remote_api operates over HTTP, every datastore call you make incurs quota usage for HTTP requests, bytes in and out, as well as the usual datastore quota you would expect. Bear this in mind if you're using remote_api to do bulk updates.

1 MB API limits apply

As when running natively, the 1MB limit on API requests and responses still applies. If your entities are particularly large, you may need to limit the number you fetch or put at a time to keep below this limit. This conflicts with minimising round-trips, unfortunately, so the best advice is to use the largest batches you can without going over the request or response size limitations. For most entities, this is unlikely to be an issue, however.

Avoid iterating over queries

One common pattern with datastore access is the following:

q = MyModel.all()
for entity in q:
    # Do something with entity

When you do this, the SDK fetches entities from the datastore in batches of 20, fetching a new batch whenever it uses up the existing ones. Because each batch has to be fetched in a separate request by remote_api, it's unable to do this as efficiently. Instead, remote_api executes an entirely new query for each batch, using the offset functionality to get further into the results.

If you know how many entities you need, you can do the whole fetch in one request by asking for the number you need:

entities = MyModel.all().fetch(100)
for entity in entities:
    # Do something with entity

If you don't know how many entities you will want, you can use cursors to efficiently iterate over large result sets. This also allows you to avoid the 1000 entity limit imposed on normal datastore queries:

query = MyModel.all()
entities = query.fetch(100)
while entities:
    for entity in entities:
        # Do something with entity
    query.with_cursor(query.cursor())
    entities = query.fetch(100)

Transactions are less efficient

In order to implement transactions via remote_api, it accumulates information on entities fetched inside the transaction, along with copies of entities that were put or deleted inside the transaction. When the transaction is committed, it sends all of this information off to the App Engine server, where it has to fetch all the entities that were used in the transaction again, verify that they have not been modified, then put and delete all the changes the transaction made and commit it. If there's a conflict, the server rolls back the transaction and notifies the client end, which then has to repeat the process all over again.

This approach works, and exactly duplicates the functionality provided by transactions on the local datastore, but is rather inefficient. By all means use transactions where they are necessary, but try to limit the number and complexity of the transactions you execute in the interest of efficiency.

Putting remote_api to work

Now that we've demonstrated the power of remote_api and outlined its limitations, it's time to put what we've learned to work with a practical tool. Frequently it would be useful to be able to iterate over every entity of a given kind, be it to extract their data, or to modify them and store the updated entities back to the datastore.

In order to achieve this, we're going to implement a simple 'map' framework. We'll define a class, Mapper, that exposes a map() method for subclasses to extend, and a couple of fields - KIND and FILTERS - for them to define what kind to map over, and any filters to apply.

class Mapper(object):
    # Subclasses should replace this with a model class (eg, model.Person).
    KIND = None

    # Subclasses can replace this with a list of (property, value) tuples to filter by.
    FILTERS = []

    def map(self, entity):
        """Updates a single entity.

        Implementers should return a tuple containing two iterables (to_update, to_delete).
        """
        return ([], [])

    def get_query(self):
        """Returns a query over the specified kind, with any appropriate filters applied."""
        q = self.KIND.all()
        for prop, value in self.FILTERS:
            q.filter("%s =" % prop, value)
        return q

    def run(self, batch_size=100):
        """Executes the map procedure over all matching entities."""
        q = self.get_query()
        entities = q.fetch(batch_size)
        while entities:
            to_put = []
            to_delete = []
            for entity in entities:
                map_updates, map_deletes = self.map(entity)
                to_put.extend(map_updates)
                to_delete.extend(map_deletes)
            if to_put:
                db.put(to_put)
            if to_delete:
                db.delete(to_delete)
            q.with_cursor(q.cursor())
            entities = q.fetch(batch_size)

As you can see, there's not much to it. First, we define a convenience method, get_query(), that returns a query that matches the kind and filters specified in the class definition. This method could optionally be overridden by a subclass, for example to support varying the filters at runtime, as long as it uses only equality filters. Then, we define an instance method, run(), which iterates over every matching entity in batches, calling the map() function on each one, and updating or deleting the entity as appropriate.

One caveat to our Mapper: The map process does not work from a snapshot of the datastore. So if you return new entities from map() that themselves meet the criteria for mapping, you may get them passed in to the map() function later in the process. Whether or not they do depends on where their key sorts compared to the current record's key. As a general rule, if you're going to create new entities of the same type in a map() function, you need some way to distinguish them from the original entities so you don't process them a second time.

In order to use this class, we define a subclass that implements the map() function. In this example, we're going to add the phrase 'Bar!' to any guestbook entry that contains the phrase 'foo':

class GuestbookUpdater(Mapper):
    KIND = Greeting

    def map(self, entity):
        if entity.content.lower().find('foo') != -1:
            entity.content += ' Bar!'
            return ([entity], [])
        return ([], [])

Then, we instantiate our class and call run():

mapper = MyMapper()
mapper.run()

You can try this out for yourself easily: Just enter the code in the interactive console we set up earlier.

Finally, here's a practical - though trivial - example of where our new framework can be useful: Deleting all the entities of a given kind.

class MyModelDeleter(Mapper):
    KIND = MyModel

    def map(self, entity):
        return ([], [entity])

Simple! Because Mapper takes care to always access KIND and FILTERS as instance variables, we can even generalize this to allow you to select the kind and filters at runtime:

class BulkDeleter(Mapper):
    def __init__(self, kind, filters=None):
        self.KIND = kind
        if filters:
          self.FILTERS = filters

    def map(self, entity):
        return ([], [entity])

Of course, this is only the start of what you can do with remote_api and the Mapper framework. If you have a novel use you've come up with, please post it to the group - we'd love to hear about it.

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.