Google App Engine

How Entities and Indexes are Stored

Jason Cooper
December 2009

This is one of a series of in-depth articles discussing App Engine's datastore. To see the other articles in the series, see Related links.

The intent of this article is providing greater insight into how your data is stored behind the covers by describing the tables and data types used to store application data. It's designed to function as a reference for estimating the amount of datastore space needed ahead of a large-scale data import. This information is provided for estimation purposes only – there may be discrepancies between the estimated and actual resource usage due to factors such as data encoding and replication, but this article should still help you better understand how your data is stored. Also note that the data structures described below may change in subtle and not-so-subtle ways over time, and this article will be updated periodically to account for any changes. As before, the number reported in the Dashboard will continue to be the canonical source for total data usage for a particular app.

This article is broken down into the following sections:

Background

App Engine's datastore is built on top of Bigtable, a large, distributed key-value store that is built to scale. Several other large Google properties use Bigtable including web search and Gmail. While Bigtable does use tables to store data, it differs from many other relational database management systems like MySQL and PostgreSQL because it is schema-less – a given row in a Bigtable table may have different columns than any other row before it. In this sense, a Bigtable functions as a large, sorted, multi-dimensional array with each row having a collection of columns, which may or may not be the same set of columns of other rows in the same table.

A full discussion and tear-down of Bigtable is outside the scope of this article. If you want more low-level details about how this storage system works, please read the white paper.

Notation

Throughout this article, I'll reference a handful of data types in my discussion of property fields. Each of these types has a physical size, so rather than repeating this information throughout the article, I'll define it here.

Type Bytes Description
str *

str fields are variable-length strings, meaning fields of this type don't have a fixed size – the physical space needed depends both on the length and character encoding of the string. Characters in the ASCII set need only one byte while non-ASCII characters consume two or more bytes each. The physical size of a given string field is the total number of bytes needed to represent the string with the specified encoding.

Internally, all str values are UTF-8 encoded and sorted by UTF-8 code points.

int32 4 int32 fields encode integers as 32-bit bit strings, so fields of this type take up 4 bytes of physical space.
int64 8 int64 fields encode integers as 64-bit bit strings, so fields of this type take up 8 bytes of physical space.
double 8 double fields represent floating-point numbers as 64-bit bit strings consuming 8 bytes of physical space.
bool 4

bool fields encode two values – true and false – using 4 bytes of physical storage.

Although boolean properties could be more efficiently stored using a single byte, a four-byte word encoding is necessary due to the protocol buffer toolchain used internally.

Several rows in the information tables below have (...) below their name. This indicates that the row may correlate to either multiple columns in the corresponding Bigtable (e.g. the Entities table shows one row for "Index ID" even though there can be multiple indexes applicable to a given entity) OR multiple rows, one per value, in the corresponding Bigtable. See the description next to each (...) field for more context

Tables

App Engine uses a total of six Bigtables to store all datastore entities and indexes for every deployed App Engine application. Each of these tables is detailed below with a summary of the columns in each and the physical storage required.

Entities table

This one Bigtable holds all entities for all App Engine applications. The key of each entity row contains the application ID associated with the current entity so applications can only retrieve its own entities. The following sections break down the data stored in the entities table and how much space each item consumes.

Key

An entity's key contains the ID of the application that created the entity along with the path, a list of kind names paired with either a numerical ID or string-based key name. Every path is guaranteed to be unique. For more on keys, see the documentation.

Name Type Description
Application ID str This is the identifier you selected when you created the application in the Administration Console.

Due to the way that application IDs are encoded and stored in Bigtable, choosing a smaller application ID does not generally result in a drastic storage differential over longer IDs, although there may be a small savings depending on the number of entities in your datastore. Conversely, application IDs used in reference properties are not encoded, so if your datastore entities contain many reference properties, they will consume less space in apps with a shorter app ID.

Path multiple

A path is a concatenation of entity keys. Every path begins with the key of the root entity (which may be the current entity itself) in the current entity group. If the current entity is not the root, then the key of each ancestor is appended to the path, from top to bottom, until the current entity's key is appended.

The first component of a key is the entity kind – the model or class name given to a model object (str). The next component is an ID (int64) or key name (str). Note that entities can have a numerical ID or key name but not both. If you manually set the key name, your entities may be physically larger than they would with an App Engine-generated ID.

For more on keys, see the documentation (Python | Java | Go).

Entity keys are actually stored twice, once in the Entities table's row name using a special prefix delta encoding scheme for more efficient space utilization and again in the entity protocol buffer without the special encoding. Both the prefix delta encoding scheme and protocol buffers are described in more detail later in this article.

Properties

Every entity has one or more properties in addition to its key. Every property has a name and value, the value corresponding to one or more types defined in Notation above. A subset of the available properties and their corresponding value type or types is provided below. Based on the property type, you can determine the total storage space that will be needed to store a value of that type, although in many cases, this depends on the actual value.

StringProperty str
IntegerProperty int64
FloatProperty double
BooleanProperty bool
ReferenceProperty Key
GeoPtProperty Composite type of latitude (double) and longitude (double) – 16 bytes total
UserProperty Composite type of email address (str), authentication domain (str), and internal ID (int64)

The total space consumed by an entity depends not only on the property types and values but on the names of the properties as well. For each property value, the following is stored:

Name Type Description
Property name str

The property name is defined as the name of the field used to refer to a given property in the corresponding model class.

Because App Engine's datastore is essentially schemaless, every entity of a particular kind must store the name of each property in addition to the value, even if every other entity of that kind uses the same set of properties. While this redundancy does result in a slight storage overhead, it also offers more flexibility for model definitions (see Effective PolyModel for an example).

Property value(s) *

The value of a property can be one or more of the property types listed above.

Internally, App Engine stores entities in protocol buffers, efficient mechanisms for serializing structured data; see the open source project page for more details. Instead of storing each entity property as an individual column in the corresponding Bigtable row, a single column is used which contains a binary-encoded protocol buffer containing the names and values for every property of a given entity.

Protocol buffers associate a tag number with every field, and each tag number requires at least one byte of space. They also require additional bytes for non-native property types like GeoPtProperty and LinkProperty as well as multi-valued property types. So in addition to the bytes needed for the actual property values, you should also account for the extra space needed by the protocol buffers themselves.

Entity metadata

Entity rows also contain the following columns:

Name Type Description
Entity group Key This column contains the key of the root entity in the hierarchy of the current entity, which may be the current entity's key itself.
Kind str This column contains the name of the model or class that the entity referenced in the last column belongs to, e.g. Employee.
Data for custom indexes

Custom indexes are necessary for executing complex queries, and are covered in greater depth later in this article – see EntitiesByCompositeProperty. In addition to the EntitiesByCompositeProperty Bigtable, custom index row data is stored directly in the Entities table as well. Given this, the more custom indexes your application requires, the more space is needed for every entity.

For each custom index defined for the entity's kind, the following is stored:

Name Type Description
Index ID int64 This contains the 64-bit identifier of the corresponding index.
Ancestor
(...)
Key

If the custom index includes ancestors, this data item will be repeated for each entity in its path. One item contains the key of the current entity while the repeated items contain the key of each distinct ancestor (assuming the current entity is not a root entity).

Property values
(...)
* Finally, the value of each property used in the custom index is stored. If a given property is multi-valued, this data item is repeated for each combination of values. Note that this can result in explosion – see below for more information. As in Properties above, the total space needed depends on the type and value of the individual properties.

As with Properties above, each custom index data item is encoded as a protocol buffer and stored in a single repeated column in Bigtable.

Index tables

Four Bigtables are used to store all indexes for every App Engine application. Each table stores data for a particular type of index. The first three of these tables – EntitiesByKind, EntitiesByProperty ASC, and EntitiesByProperty DESC – are managed by App Engine automatically as new entities are written. The fourth, EntitiesByCompositeProperty, must be explicitly defined. For more on indexes, see the documentation (Python | Java | Go) and this Google I/O presentation.

Typically, the total space needed for str fields is dependent on the length and encoding of the value. For the str columns in the index tables described below, however, a special prefix delta encoding scheme is used which can shave bytes off the total that would otherwise be needed. For estimation purposes, you can treat these at regular str values, but the actual space used is typically smaller.

EntitiesByKind

This index table enables you to retrieve all entities of a particular kind. There are three pieces of data stored in this table: app ID, kind, and the key of a corresponding entity. Every time a new entity of any kind is added, a new row is automatically added to this table with the new entity's key so it can be queried later.

Name Type Description
Application ID str This column contains the application to which the entity referenced in the "Entity key" column belongs.
Kind str This column contains the name of the model or class that the entity referenced in the "Entity key" column belongs to, e.g. Employee.
Entity key Key The entity key is stored so the entity itself can be efficiently retrieved if it's returned as a result for the executed query.

EntitiesByProperty ASC

Unless you specify otherwise using indexed=False, indexes are automatically added for each individual property aside from those of type Text and Blob which can't be indexed. These indexes do not appear in the index configuration file or Admin Console but they exist to facilitate simple queries like the following:

SELECT * FROM Person WHERE last_name = "Smith"

Like the previous index table, this table is fairly simple, containing app ID, kind, property name and value, and the key of a corresponding entity.

Name Type Description
Application ID str This column contains the application to which the entity referenced in the "Entity key" column belongs.
Kind str This column contains the name of the model or class that the entity referenced in the "Entity key" column belongs to, e.g. Employee.
Property name str Since this Bigtable is used to facilitate queries on single properties, the name and value of every indexed property of every entity must be stored. This column and the next are used for this purpose.
Property value * This column contains the value of the property named in the previous column. The total space needed depends on the type and value of the property (see Property values).
Entity key Key The entity key is stored so the entity itself can be efficiently retrieved if it's returned as a result for the executed query.

For multi-valued properties, such as ListProperty and StringListProperty, each value has its own index row, so using multi-valued properties does result in more indexing overhead.

EntitiesByProperty DESC

This table contains the same columns and data as the table immediately above but sorted in the opposite direction in order to efficiently handle queries with descending sort orders like the following:

SELECT * FROM Person WHERE height < 63 ORDER BY height DESC

EntitiesByCompositeProperty

The other tables described in this section support automatic (a.k.a. built-in or non-custom) indexes. Custom indexes are necessary for executing more complex queries, including queries with multiple sort orders or queries with an equality filter on one property and an inequality filter on another. Custom indexes require an entry in index.yaml (Python and Go) or datastore-indexes.xml (Java).

The index definitions, which include the properties that are used by the index and their respective sort orders, are stored in a separate Bigtable which is covered in the next section. These definitions are keyed by index ID, the same field used to reference the entities in the next table:

Name Type Description
Index ID int64 This column contains the 64-bit identifier for the custom index.
Application ID str This column contains the application for which this index is defined for.
Kind str This column contains the name of the model or class of objects being targeted by the current index, e.g. Employee.
Ancestor
(...)
Key

All non-root entities have exactly one parent entity; a parent entity, in turn, can have a parent and so on. If the current custom index uses ancestor filters, then one index row is inserted into this Bigtable per ancestor including a row for the current entity itself – in other words, the index row is exploded. If the custom index does not use ancestor filters, this column is left empty.

Property values
(...)
* Each index row contains one column per referenced property; the total space needed depends on the type and value of the property (see Properties).
Entity key Key

The entity key is stored so the entity itself can be efficiently retrieved if it's returned as a result for the executed query.

Custom indexes table

As described above, custom indexes are needed to execute more complex queries, usually using multiple properties. The table below is used to define the custom indexes that have been created for a given application. If a query is attempted that doesn't have a corresponding index in this table, the system will throw an exception indicating that a new index is needed in either index.yaml (Python and Go) or datastore-indexes.xml (Java).

The columns of this table and the data contained therein are detailed below.

Index definition

The index definition is a set of columns that describe the data stored in the index, including the kind of entities returned and the properties used to filter and sort the results.

Name Type Description
Kind str This column contains the name of the model or class of objects being targeted by the current index, e.g. Employee.
Ancestor bool

This column contains a boolean value indicating whether the current index uses ancestor filters, i.e. whether the index can only return entities below a given entity in the hierarchy.

If this value is true, then the actual index rows in the EntitiesByCompositeProperty Bigtable are exploded, meaning an index row is added for each ancestor of a given entity. Unlike exploding indexes caused by multi-valued properties, this explosion should not cause any issues unless the entity is very deep in its entity group hierarchy. That said, you should understand that ancestor queries will take up more space since multiple index rows are needed to support them.

Property names
(...)
str

Custom indexes almost always involve multiple properties. The names of each property referenced in the corresponding query are stored as individual columns in the index row.

Ascending
(...)
bool For every property column, an additional column is needed to indicate whether the property is sorted from lowest value to highest (true) or from highest value to lowest.

Be very cautious when defining a custom index involving several multi-valued properties (e.g. StringListProperty) or when defining an ancestor-enabled index that also involves one or more multi-valued properties. As noted above, a single multi-valued property results in several index rows (one row per value), but when multiple multi-valued properties are involved, a row must be included for every permutation of the values of every property. This is relevant for ancestor-enabled indexes as well since ancestors are also "exploded." See the documentation (Python | Java | Go) for more information on exploding indexes.

Additional metadata

Custom index rows also contain the following columns in addition to the ones above.

Name Type Description
Index ID int64 This column contains the 64-bit identifier for this custom index.
Application ID str This column contains the application for which this index is defined for.
State int32 Custom indexes are always in one of four states: Building, Serving, Deleting, or Error. These states are encoded as 32-bit integer values and stored in this column.

Id sequences table

ID sequences are used to generate numerical datastore IDs for both entities and custom indexes. Every application currently has at least two rows in this Bigtable, one for all root entities and another for all custom index definitions. In addition, there is one row per entity group that has or has ever had a non-root entity. So if your app only uses root entities, it will only need two rows in this Bigtable, but there can be many more depending on the number of entity groups with non-root entities.

Name Type Description
Application ID str This is the identifier you selected when you created the application in the Administration Console.
Entity key Key If this ID sequence is for a specific entity group, this column contains the key of that entity group's root entity. The root entities and custom index ID sequences use special placeholder values here.
Next ID int64 This column contains the next ID to be assigned for a given resource (a new root entity, new custom index definition, or new child under an existing root entity, depending on the row type).

Conclusion

The sections above describe the Bigtables used by App Engine's datastore layer to store all entities and indexes across all App Engine applications and offer hints to decrease the storage needed for your application, e.g. declaring properties that won't be used in any queries as unindexed so single-property indexes are not unnecessarily built. With any luck, this article helped you better understand how the datastore works and how your storage quota is being used.

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.