November 24, 2008
Every App Engine datastore query is served by an index. The built-in indexes can handle simple queries, including all entities of a given kind, filters and sort orders on a single property, and equality filters on any number of properties. More complex queries, on the other hand, need custom developer-defined indexes.
Internally, we call these composite indexes, to distinguish them from the built-in kind and single property indexes. We use the word composite because their index data is composed of multiple values per index row.
When you add a new composite index to your app, it must be populated with existing data from your datastore before it can serve queries. Similarly, when you delete an index from your app, its index rows must be removed. App Engine runs an index building workflow behind the scenes that populates and deletes indexes.
App Engine stores index data in four Bigtables, shared across all apps. The first three tables store the built-in kind, single property ascending, and single property descending indexes. The last table stores all of the composite indexes.
The index data for a given index row is stored in the Bigtable row name. This includes app id, kind, key, and extra data that varies by index. The built-in single property indexes' extra data includes the property name and value. The built-in kind index has no extra data.
Composite indexes' extra data includes the values for each property in the index definition, along with the ancestor, if specified. Property names are omitted to save space, since the index definitions themselves include the property names. However, composite index rows do include an index id, which distinguishes between index rows from different composite indexes that happen to have the same property values in the same order.
For more on index layout, see the Google I/O talk Under the Covers of the Google App Engine Datastore.
Index Building and Deleting
When you define a new index and ask us to build it, we follow these steps:
- Add the index definition metadata to the app.
- Mark the index metadata Building. This tells the datastore to update it on all writes, but to not use it for queries (yet).
- Map over the existing entities and populate them in the index.
- Mark the index Serving. This tells the datastore that the index is available for querying.
Deleting an index is almost the same process:
- Mark the index metadata Deleting. This tells the datastore to ignore it on both writes and queries.
- Remove the index's rows.
- Delete the index's metadata.
When an entity is created or deleted, all of the index rows for that entity must be updated. However, when an existing entity is updated, the datastore determines the delta, i.e. the properties that were changed, and only updates the index rows that include those properties.
Furthermore, the datastore handles composite indexes automatically. When an entity is put or deleted, the datastore determines the composite index rows that need updating, as above, and updates them.
Since the datastore handles index updates efficiently and automatically, the index building workflow simply reuses the datastore's code. To build or delete an index, the workflow maps over all of the app's entities, one at a time. For each entity, the workflow uses the datastore to read it and write it back, unchanged, in a transaction. The datastore detects any Building or Deleting indexes and updates the relevant index rows accordingly.
Doing Work in Parallel
The index building workflow has workers that build and delete indexes incrementally, in parallel. This allows it to build and delete indexes for different apps at the same time. It also allows for parallelizing the work of building or deleting a single app's indexes. Finally, it allows for checkpointing and resuming work if a worker dies for any reason.
The workflow maintains a central list of all apps with indexes to be built or deleted. Each app's data is split into shards. When a worker is idle, it consults this list, takes a lease on a shard of data, and maps over the entities in that shard, populating index rows as described above. If the lease expires before the worker is done, it gives up, discards the partially completed work, and the shard becomes available for other workers to retry.
We've implemented a few different mechanisms for sharding data: first by Bigtable tablet, then later within tablets. Both of these methods are explained below.
Splitting by Tablet
Surprisingly, splitting an app's entities into shards isn't easy. Ideally, the workflow would just ask Bigtable for every Nth row, starting from the app's first entity and ending at the last. Those would be the "split" points that demarcate each shard.
However, Bigtable intentionally doesn't provide such an offset operation. So, we approximated. Bigtable splits data into contiguous ranges of rows called tablets, which may be split, merged, and moved between tablet servers at runtime. Bigtable does know each tablet's start and end rows, so we started by inserting a shard for each tablet.
Unfortunately, tablets are limited only by total data size, so the number of rows (ie entities) can vary widely across tablets. The time required to populate index data varied much more by the number of properties in an entity than the size of that entity, so different tablets shards took workers drastically different amounts of time. Workers often exceeded their leases on tablets with a large number of rows, which kept indexes building or deleting for long periods of time.
Splitting Within Tablets
To address this, we recently added support for splitting large shards. When a
worker notices that its lease on a shard is about to expire, it gives up working
and instead splits the shard into
n smaller shards. Each smaller
shard includes the original shard's row range and two sharding
n, which is the same for all pieces,
k, which ranges from
When a worker takes on a shard that's been split with sharding parameters, it
maps over all of the entities in the shard's row range, as usual. As it
encounters each entity, it hashes the entity's key, and only populates indexes
for that entity if the hash equals
k modulo n.
An alternative design would be to populate every
This would be simpler, but might miss entities, since the app is serving during
this process and could insert or delete entities.
It's worth emphasizing that a shard's row range doesn't change when it's split. Workers still scan over the entire row range. This scan is relatively cheap compared to the reads and writes required to update the indexes, though. Disk bandwidth is much cheaper than disk seeks, so Bigtable scans are much cheaper than random access reads and writes to individual rows!
Here are some common questions about how we build and delete indexes internally. (If you're curious about how to specify indexes for your app, see our documentation.)
Q: Why do my indexes stay Building or Deleting
for long periods of time?
A: In the past, this often happened because worker shards were too large to be completed within the lease period. To address this, we first increased the lease period. Now, we split individual tablets into shards. Index building can still take some time, but it's considerably faster than it was before.
Q: Why are my indexes marked Error?
A: They're probably exploding indexes (Python | Java). Occasionally, though, we have to move indexes into Error manually. We usually contact you directly when we do this.
Q: Why map over the app's whole datastore? Can't you use the
built-in kind index to map over just the entities of the index's kind, and
skip the other kinds?
A: The transaction isolation level of individual entity reads and writes is
READ COMMITTED, but indexes don't have the
same guarantee. Given that, the kind index might miss some entities. See
Isolation in App Engine for more details.
Q: When an index is deleted, can't you just delete its index rows
directly? You could do a Bigtable prefix scan with the app id and the index
id, right? Why map over the entities themselves?
A: We wish it were that easy! The entities include metadata about the composite indexes that apply to them, so we need to update that metadata as well as deleting the index rows.