Google App Engine

Index Selection and Advanced Search

Alfred Fuller
September 2011

This is one of a series of in-depth articles discussing App Engine's datastore. To see the other articles in the series, see Related links.

The App Engine Datastore supports a fast and efficient means of querying large data sets while providing a relatively flexible set of SQL-like features. We accomplish this by shifting most of the querying cost from query time to write time through prebuilt indexes. Doing so requires knowing the form of all queries that can be possibly run in your application.

However, there are situations where it's impossible to know the exact form of a query ahead of time, like when the filters of the query are constructed dynamically based on user input. In these cases, all possible combinations of filters must be supported by the indexes defined by the application. Traditionally, this has required a combinatorial explosion in the number of indexes defined. Recent enhancements to the App Engine Datastore query planner have eliminated the requirement for such a proliferation in an application's indexes. This article describes how to take full advantage of these enhancements.

For brevity, we are using a shorthand notation for index definitions. The following definition denotes an index of kind model, with property prop1 ascending and prop2 descending:

Index(model, prop1, -prop2)

This is equivalent to the following configuration in index.yaml:

indexes:
- kind: model
  ancestor: no
  properties:
  - name: prop1
  - name: prop2
    direction: desc

or this configuration in datastore-indexes.xml:

<datastore-indexes>
    <datastore-index kind="model" ancestor="false">
        <property name="prop1" direction="asc" />
        <property name="prop2" direction="desc" />
    </datastore-index>
</datastore-indexes>

For more information, see Index Configuration (Python | Java).

Consider the following definition for a "Photo" entity:

Photo
Property Value type Meaning
owner_id String User id
tag List of strings Tokenized keywords
size Integer Enum:
  • icon
  • medium
  • large
license Integer Enum:
  • none
  • reuse
  • commercial reuse
  • reuse with modification
  • commercial reuse with modification
aspect Integer Enum:
  • tall
  • square
  • wide
  • panoramic
coloration Integer Enum:
  • black & white
  • color
date_added Integer Date/time
rating Float Aggregate user rating
comment_count Integer Number of comments
download_count Integer Number of downloads

If you know that you will be running queries of the form

SELECT * FROM Photo WHERE license=<reuse> ORDER BY date_added DESC

The following index must be added to your application at deployment time:

Index(Photo, license, -date)

If this index does not exist, the datastore returns a NeedIndexError in Python, or a DatastoreNeedIndexException in Java, every time this query is run.

For more information, see the Introduction to Indexes (Python | Java).

As an example, let's create an "Advanced Search" feature for photos. A fully interactive version of this form is available at http://advanced-search-demo.appspot.com/. Users can specify the following options:

Owner
Size Aspect
Coloration License
Order by

To allow for this type of search, there must be an index for every combination of filters and orders. As App Engine automatically builds ascending and descending indexes on properties, we do not need to include indexes where there are no filters and a single order; however, the following indexes are needed:

Index(Photo, owner_id, -date)
Index(Photo, owner_id, -rating)
Index(Photo, owner_id, -comments)
Index(Photo, owner_id,- downloads)
Index(Photo, size, -date)
Index(Photo, size, -rating)
Index(Photo, size, -comments)
Index(Photo, size, -downloads)
...
Index(Photo, owner_id, size, -date)
Index(Photo, owner_id, size, -rating)
Index(Photo, owner_id, size, -comments)
Index(Photo, owner_id, size, -downloads)
...
Index(Photo, aspect, coloration, license, owner_id, size, -date)
Index(Photo, aspect, coloration, license, owner_id, size, -rating)
Index(Photo, aspect, coloration, license, owner_id, size, -comments)
Index(Photo, aspect, coloration, license, owner_id, size, -downloads)

The total number of indexes is 2^(number of filters) * (number of different orders) = 2 ^ 5 * 4 = 128 indexes

With this number of indexes, putting a single Photo entity will cost 128 datastore write operations for the composite indexes alone (or $0.128 per 1000 entities as of the time of this writing.).

It is possible to specify this many indexes, but doing so has risks:

  • Potential to exceed the index cap (200)
  • Greatly increased write costs (an additional 128 writes per entity)
  • Greatly increased storage cost per entity (as this cost includes the size of the index entries)

However, specifying many indexes also has some benefits:

  • Query performance is as fast as possible.
  • Query performance is consistent and not dependent on shape of data.

You may have noticed that the last example did not include a tag property. Exposing the filters on tag to the user complicates the system greatly. The most desirable way to expose this in an "Advanced Search" box is to allow the user to specify a search string, which is tokenized into tags and a filter added for each tag.

For example, "outside, family" would be tokenized into "outside" and "family" and the filters "tag = 'outside' AND tag = 'family'" would be added to the query that fetches the photos. Including up to two tag filters in the combinatorial explosion of query possibilities results in: 2 ^ (# of filters) * (number of different orders) = 2 ^ 7 * 4 = 512 indexes.

Creating this many indexes:

  • Exceeds the index cap.
  • Supports only two tags when it is impossible to know how many tags the user will enter.
  • Causes the number of index values to explode exponentially (Python | Java), because a multivalued property such as tag is repeated in an index. As a result, the number of index values per entity rapidly approaches the maximum (5000) and the cost of writing and storing each entity increases accordingly.

If we look at how the index values exploded, we find that we can list tag a maximum of five times in a single index for there to be any chance of returning a result. The following table shows the number of values generated in each index when storing an entity with a tag property containing N values. It also shows the maximum number of tag values for that entity:

Index # of index values Maximum # of tags
Built-in (asc, desc) 2N2500 (capped at 1000 for a single list property)
+Index(Photo, tag, ...) N 1666 (capped at 1000 for a single list property)
+Index(Photo, tag, tag, ...) N ^ 269
+Index(Photo, tag, tag, tag, ...) N ^ 316
+Index(Photo, tag, tag, tag, tag, ...) N ^ 4 8
+Index(Photo, tag, tag, tag, tag, tag, ...) N ^ 5 5

Improved Query Planner

Luckily, the App Engine datastore now provides a new and improved query planner that can shift some of the query cost from write time to query time. It does this efficiently using the zigzag merge join algorithm, which tends to scale with the size of the result set much like querying against the indexes described above under Advanced Search. By design, the datastore supports only operations that scale in this fashion. The performance of operations that scale with the size of the entire database will deteriorate as the datastore grows.

The query planner can merge the results of any scans of indexes that are sorted by the same properties and find the results that are common to all such indexes (using the zigzag merge join algorithm). For example, Index(Photo, owner, -date_added) and Index(Photo, size, -date_added) are both ordered by -date_added. This means that these indexes are sufficient if you want to find entities that have both owner="user@example.com" and size=1 ordered by date_added descending. Previously, an additional index Index(Photo, owner, size, -date_added) would have been required.

The new query planner can also merge results from multiple sections of the same index. To find entities where the list of tags contained both "family" and "outside" ordered by date_added descending (SELECT * FROM Photo WHERE tag = 'family' AND tag = 'outside' ORDER BY date_added DESC), you need only the single index Index(Photo, tag, -date_added). Before the improvements, you would have needed the exploding index Index(Photo, tag, tag, -date_added).

To support the entire advanced photo search (including tag filters), the following is the minimal list of indexes required to run any combination of user-selected filters and sort orders:

Index(Photo, owner_id, -date)
Index(Photo, owner_id, -rating)
Index(Photo, owner_id, -comments)
Index(Photo, owner_id,- downloads)
Index(Photo, size, -date)
Index(Photo, size, -rating)
Index(Photo, size, -comments)
Index(Photo, size, -downloads)
...
Index(Photo, tag, -date)
Index(Photo, tag, -rating)
Index(Photo, tag, -comments)
Index(Photo, tag, -downloads)

The number of index entries needed is (number of filters + 1) * (number of orders) = 7 * 4 = 28. This is a much more manageable number. Additionally, none of these indexes are exploding, so the additional write cost and the storage cost of entities is similarly small (in this case an order of magnitude smaller, $0.024 instead of $0.128 per 1000 entities, not including filters on tag). With these indexes, a photo can have up to 1666 tags (capped at 1000) per photo, and there is no limit on the number of tags you can filter on. The drawback is that the performance of a query that uses zigzag merge join depends on the shape of the data. We'll discuss this further in the following section.

Performance

Because the zigzag merge join algorithm that enables this type of query combines filter components at read time instead of write time, the performance can vary depending on what filters are imposed. The best-case performance of zigzag merge join is

O(R), where R is the size of the result set

while the worst case performance is

O(S), where S is the size of smallest set of entities matching a single index scan

An index scan usually maps to a single filter in the query: for example, using the indexes listed above, this query

WHERE owner_id='12345' AND size=1 AND tag='family' ORDER BY date DESC

maps to an index scan per filter on the following indexes:

Index(Photo, owner_id, -date)
Index(Photo, size, -date)
Index(Photo, tag, -date)

However, if Index(Photo, owner_id, size, -date) were defined, it would map to two index scans on the following indexes:

Index(Photo, owner_id, size, -date)  #  Satisfies both 'owner_id=12345' and 'size=1'
Index(Photo, tag, -date)

The actual performance depends on the shape of the data. Specifically, the average number of entities considered for each result returned is O(S/R). This indicates that poor performance is likely when many entities match each scan, but few entities match the query as a whole (R is small and S is large).

Four things mitigate this risk:

  • The query planner looks up the actual entity only when it knows that entity matches the entire query.
  • The zigzag algorithm does not need to find all results to return the next result. This means if you request the first 10 results, you only pay the latency for finding those 10 results. Additionally, App Engine uses streaming and asynchronous prefetching to hide this latency as much as possible.
  • The zigzag algorithm will skip large sections of false positive results (results present in some, but not all, scans), so the worst-case performance happens only if false positive results are perfectly interwoven (in sort order) between scans.
  • The latency is based on the number of entities found in each "index scan", not the number of entities that match each "filter". This means that you can use better indexes when needed.

For example, consider a scenario in which you have a large number of panoramic images and a lot of black-and-white images, but very few panoramic images that are black-and-white. In this scenario, queries will be slow if they contain filters on these properties that use the following indexes:

Index(Photo, aspect, ...)
Index(Photo, coloration, ...)

However, the following index will produce very few results for the same filters, which in turn lowers the value of S and improves performance:

Index(Photo, aspect, coloration, ...)

This approach costs only an additional index and index write per entity, but it substantially improves performance. Figuring out what indexes are optimal for your app is a difficult problem. The answer can change as the shape of your data changes. However, there is no harm in experimentation if you keep the base indexes needed. For example, sampling query performance can give you a good idea of what queries are common and what queries are slow. Adding indexes to improve the performance of queries that are both common and slow is likely a step in the right direction.

Testing

As of the 1.5.3 SDKs, you can test the validity of your index definitions in the development web server. To do so, configure the development server so it does not auto-generate indexes. If you have a Python application, you can disable auto-generated indexes in the development web server using dev_appserver.py:

dev_appserver.py --require_indexes

If you have a Java application, you can disable auto-generated indexes in the development web server by specifying autoGenerate="false" in datastore-indexes.xml:

<datastore-indexes autoGenerate="false">

After you've disabled auto-generation, if a query cannot be satisfied by the available indexes, the development web server throws an exception (NeedIndexError in Python, or a DatastoreNeedIndexException in Java). Additionally, the exception will list both the minimal index and the index it would have auto-generated. The minimal index is the smallest single index that satisfies the given query. It is meant to give you an idea of what is needed, but is not necessarily to be used verbatim.

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.