Google App Engine

Documents and Indexes

Python |Java |PHP |Go

The Search API provides a model for indexing documents that contain structured data. You can search an index, and organize and present search results. The API supports partial text matching on string fields. Documents and indexes are saved in a separate persistent store optimized for search operations. The Search API can index any number of documents. However, an index search can find no more than 10,000 matching documents. The App Engine Datastore may be more appropriate for applications that need to retrieve very large result sets.

Note: The Search API is available only to applications using the High Replication Datastore (HRD). If your application uses the now-deprecated Master/Slave Datastore, migrate to HRD.

  1. Overview
  2. Documents and fields
  3. Creating a document
  4. Working with an index
  5. Index schemas
  6. Viewing indexes in the Admin Console
  7. Search API quotas and pricing

Overview

The Search API is based on four main concepts: documents, indexes, queries, and results.

Documents

A document is an object with a unique ID and a list of fields containing user data. Each field has a name and a type. There are several types of fields, identified by the kinds of values they contain:

  • Atom Field - an indivisible character string
  • Text Field - a plain text string that can be searched word by word
  • HTML Field - a string that contains HTML markup tags, only the text outside the markup tags can be searched
  • Number Field - a floating point number
  • Date Field - a date object with year/month/day and optional time
  • Geopoint Field - a data object with latitude and longitude coordinates

The maximum size of a document is 1 MB.

Indexes

An index stores documents for retrieval. You can retrieve a single document by its ID, a range of documents with consecutive IDs, or all the documents in an index. You can also search an index to retrieve documents that satisfy given criteria on fields and their values, specified as a query string. You can manage groups of documents by putting them into separate indexes.

There is no limit to the number of documents in an index, or the number of indexes you can use. However, the total size of all the documents in a single index cannot be more than 10GB.

Queries

To search an index, you construct a query, which has a query string, and possibly some additional options. A query string specifies conditions for the values of one or more document fields. When you search an index you get back only those documents in the index with fields that satisfy the query.

The simplest query, sometimes called a "global search" is a string that contains only field values. This search uses a string that searches for documents that contain the words "rose" and "water":

index.search("rose water")

This one searches for documents with date fields that contain the date July 4, 1776, or text fields that include the string "1776-07-04":

index.search("1776-07-04")

A query string can also be more specific. It can contain one or more terms, each naming a field and a constraint on the field's value. The exact form of a term depends on the type of the field. For instance, assuming there is a text field called "product", and a number field called "price", here's a query string with two terms:

// search for documents with pianos that cost less than $5000
index.search("product = piano AND price < 5000")
Query options, as the name implies, are not required. They enable a variety of features:

  • Control how many documents are returned in the search results.
  • Specify what document fields to include in the results. The default is to include all the fields from the original document. You can specify that the results only include a subset of fields (the original document is not affected).
  • Sort the results.
  • Create "computed fields" for documents using FieldExpressions and abridged text fields using snippets.
  • Support paging through the search results by returning only a portion of the matched documents on each query (using offsets and cursors)

Search results

A call to search() can only return a limited number of matching documents. Your search may find more documents than can be returned in a single call. Each search call returns an instance of the SearchResults class, which contains information about how many documents were found and how many were returned, along with the list of returned documents. You can repeat the same search, using cursors or offsets to retrieve the complete set of matching documents.

Additional training material

In additional to this documentation, you can read the two-part training class on the Search API at the Google Developer's Academy. The class includes a sample Python application.

Documents and fields

The Document class represents documents. Each document has a document identifier and a list of fields.

Document identifier

Every document in an index must have a unique document identifier, or doc_id. The identifier can be used to retrieve a document from an index without performing a search. By default, the Search API automatically generates a doc_id when a document is created. You can also specify the doc_id yourself when you create a document. A doc_id must contain only visible, printable ASCII characters (ASCII codes 33 through 126 inclusive) and be no longer than 500 characters. A document identifier cannot begin with an exclamation point ('!'), and it can't begin and end with double underscores ("__").

While it is convenient to create readable, meaningful unique document identifiers, you cannot include the doc_id in a search. Consider this scenario: You have an index with documents that represent parts, using the part's serial number as the doc_id. It will be very efficient to retrieve the document for any single part, but it will be impossible to search for a range of serial numbers along with other field values, such as purchase date. Storing the serial number in an atom field solves the problem.

Document fields

A document contains fields that have a name, a type, and a single value of that type. Two or more fields can have the same name, but different types. For instance, you can define two fields with the name "age": one with a text type (the value "twenty-two"), the other with a number type (value 22).

Field names

There is a limit of 1000 unique field names over all the documents in an index. Note the limit is imposed on field names, not fields.

Field names are case sensitive and can only contain ASCII characters. They must start with a letter and can contain letters, digits, or underscore. A field name cannot be longer than 500 characters.

Multi-valued fields

A field can contain only one value, which must match the field's type. Field names do not have to be unique. A document can have multiple fields with the same name and same type, which is a way to represent a field with multiple values. (However, date and number fields with the same name can't be repeated.) A document can also contain multiple fields with the same name and different field types.

Field types

There are three kinds of fields that store character strings; collectively we refer to them as string fields:

  • Text Field: A string with maximum length 1024**2 characters.
  • HTML Field: An HTML-formatted string with maximum length 1024**2 characters.
  • Atom Field: A string with maximum length 500 characters.

There are also three field types that store non-textual data:

  • Number Field: A double precision floating point value between -2,147,483,647 and 2,147,483,647.
  • Date Field: A datetime.date or datetime.datetime.
  • Geopoint Field: A point on earth described by latitude and longitude coordinates

The field types are specified by the classes TextField, HtmlField, AtomField, NumberField, DateField, and GeoField.

Special treatment of string and date fields

When a document with date, text, or HTML fields is added to an index, some special handling occurs. It's helpful to understand what's going on "under the hood" in order to use the Search API effectively.

Tokenizing string fields

When an HTML or text field is indexed, its contents are tokenized. The string is split into tokens wherever whitespace or special characters (punctuation marks, hash sign, etc.) appear. The index will include an entry for each token. This enables you to search for keywords and phrases comprising only part of a field's value. For instance, a search for "dark" will match a document with a text field containing the string "it was a dark and stormy night", and a search for "time" will match a document with a text field containing the string "this is a real-time system".

In HTML fields, text within markup tags is not tokenized, so a document with an HTML field containing "it was a <strong>dark</strong> night" will match a search for "night", but not for "strong". If you want to be able to search markup text, store it in a text field.

Atom fields are not tokenized. A document with an atom field that has the value "bad weather" will only match a search for the entire string "bad weather". It will not match a search for "bad" or "weather" alone.

Note that the underscore (_) and ampersand (&) characters do not break words, and non-western languages, like Japanese and Chinese, use other tokenization rules.

Date field accuracy

When you create a date field in a document you set its value to a datetime.date or datetime.datetime. For the purpose of indexing and searching the date field, any time component is ignored and the date is converted to the number of days since 1/1/1970 UTC. This means that even though a date field can contain a precise time value a date query can only specify a date field value in the form yyyy-mm-dd. This also means the sorted order of date fields with the same date is not well-defined.

Other document properties

The rank of a document is a positive integer which determines the default ordering of documents returned from a search. By default, the rank is set at the time the document is created to the number of seconds since January 1, 2011. You can set the rank explicitly when you create a document. If you specify sort options, you can use the rank as a sort key. Note that when rank is used in a sort expression or field expression it is referenced as _rank.

The language property specifies the language in which the fields are encoded.

See the Document class reference page for more details about these attributes.

Linking from a document to other resources

You can use a document's doc_id and other fields as links to other resources in your application. For example, if you use Blobstore you can associate the document with a specific blob by setting the doc_id or the value of an Atom field to the BlobKey of the data.

Creating a document

The following code sample shows how to create a document object. The Document constructor is called with the fields argument set to a list of field objects. Each object in the list is created and initialized by using the constructor function of the field's class. Note the use of the GeoPoint constructor and the Python datetime class to create the appropriate types of field values.

from datetime import datetime
from google.appengine.api import search 

my_document = search.Document(
    # Setting the doc_id is optional. If omitted, the search service will create an identifier.
    doc_id = 'PA6-5000',
    fields=[
       search.TextField(name='customer', value='Joe Jackson'),
       search.HtmlField(name='comment', value='this is <em>marked up</em> text'),
       search.NumberField(name='number_of_visits', value=7), 
       search.DateField(name='last_visit', value=datetime.now()),
       search.DateField(name='birthday', value=datetime(year=1960, month=6, day=19)),
       search.GeoField(name='home_location', value=search.GeoPoint(37.619, -122.37))
       ])

Working with an index

Putting documents in an index

When you put a document into an index, the document is copied to persistent storage and each of its fields is indexed according to its name, type, and the doc_id.

The following code example shows how to access an Index and put a document into it.

from google.appengine.api import search
# create a document
...

try:
    index = search.Index(name="myIndex")
    index.put(document)
except search.Error:
    logging.exception('Put failed')
...
You can pass up to 200 documents at a time to the put() method. Batching puts is more efficient than adding documents one at a time.

When you put a document into an index and the index already contains a document with the same doc_id the new document replaces the old one. No warning is given. You can call Index.get(id) before creating or adding a document to an index to check whether a specific doc_id already exists.

The put method returns a list of PutResults, one for each document passed as an argument. If you did not specify the doc_id yourself, you can examine the id attribute of the result to discover the doc_id that was generated:

results = index.put(document)
doc_id = results[0].id

Note that creating an instance of the Index class does not guarantee that a persistent index actually exists. A persistent index is created the first time you add a document to it with the put method. If you want to check whether or not an index actually exists before you start to use it, use the search.get_indexes() function.

Updating documents

A document cannot be changed once you've added it to an index. You can't add or remove fields, or change a field's value. However, you can replace the document with a new document that has the same doc_id.

Retrieving documents by doc_id

There are two ways to retrieve documents from an index using document identifiers:
  • Use Index.get() to fetch a single document by its doc_id.
  • Use Index.get_range() to retrieve a group of consecutive documents ordered by doc_id.

Each call is demonstrated in the example below.

index = search.Index(name="myIndex")

# Fetch a single document by its doc_id
doc = index.get("AZ125")

# Fetch a range of documents by their  doc_ids
response = index.get_range(start_id="AZ125", limit=100)

Searching for documents by their contents

To retrieve documents from an index, you construct a query string and call Index.search(). The query string can be passed directly as the argument, or you can include the string in a Query object which is passed as the argument. By default, search() returns matching documents sorted in order of decreasing rank. To control how many documents are returned, how they are sorted, or add computed fields to the results, you need to use a Query object, which contains a query string and can also specify other search and sorting options.

from google.appengine.api import search
...
index = search.Index(name="myIndex")
query_string = "product: piano AND price < 5000" 
try:
    results = index.search(query_string) 

    # Iterate over the documents in the results
    for scored_document in results:
        # handle results

except search.Error:
    logging.exception('Search failed')

Deleting documents from an index

You can delete documents in an index by specifying the doc_id of one or more documents you wish to delete to the Index.delete() method. To get a range of document ids in an index, specify the ids_only argument to the Index.get_range() method. When you invoke this method, the API returns document objects populated only with the doc_id. You can then delete the documents by passing those document identifiers to the delete() method:

from google.appengine.api import search
...
def delete_all_in_index(index_name):
    """Delete all the docs in the given index."""
    doc_index = search.Index(name=index_name)

    # looping because get_range by default returns up to 100 documents at a time
    while True:
        # Get a list of documents populating only the doc_id field and extract the ids.
        document_ids = [document.doc_id
                        for document in doc_index.get_range(ids_only=True)]
        if not document_ids:
            break
        # Delete the documents for the given ids from the Index.
        doc_index.delete(document_ids)
You can pass up to 200 documents at a time to the delete() method. Batching deletes is more efficient than handling them one at a time.

Eventual consistency

When you put, update, or delete a document in an index, the change propagates across multiple data centers. This usually happens quickly, but the time it takes is variable. The Search API guarantees eventual consistency. This means that in some cases if you perform a search or retrieve one or more documents by id, the results may not reflect the most recent change.

Determining the size of an index


The total size of all documents in an index cannot be more than 10GB. (The index property storage_limit is the maximum allowable size of an index.)

The index property storage_usage is an estimate of the amount of storage space used by an index. This number is an estimate because the index monitoring system does not run continuously; the actual usage is computed periodically. The storage_usage is adjusted between sampling points by accounting for document additions, but not deletions.

Index schemas

Every index has a schema that shows all the field names and field types that appear in the documents it contains. You cannot define a schema yourself. Schemas are maintained dynamically; they are updated as documents are added to an index. A simple schema might look like this, in JSON-like form:

{'comment': ['TEXT'], 'date': ['DATE'], 'author': ['TEXT'], 'count': ['NUMBER']}

Each key in the dictionary is the name of a document field. The key's value is a list of the field types used with that field name. If you have used the same field name with different field types the schema will list more than one field type for a field name, like this:

{'ambiguous-integer': ['TEXT', 'NUMBER', 'ATOM']}

Once a field appears in a schema it can never be removed. There is no way to delete a field, even if the index no longer contains any documents with that particular field name.

You can view the schemas for your indexes like this:

from google.appengine.api import search
...
for index in search.get_indexes(fetch_schema=True):
    logging.info("index %s", index.name)
    logging.info("schema: %s", index.schema)

When you use the get_indexes function, the keyword arguments limit and offset cannot be larger than 1000. This means that even if you have more than 1000 indexes, you can never retrieve more than 1000 indexes at a time, and you'll never never be able to retrieve more than the first 2000 indexes (with offset=1000 and limit=1000).

A schema does not define a "class" in the object-programming sense. As far as the Search API is concerned, every document is unique and indexes can contain different kinds of documents. If you want to treat collections of objects with the same list of fields as instances of a class, that's an abstraction you must enforce in your code. For instance, you could insure that all documents with the same set of fields are kept in their own index. The index schema could be seen as the class definition, and each document in the index would be an instance of the class.

Viewing indexes in the Admin Console

You can view information about your application's indexes, and the documents they contain, by clicking the application's name in the App Engine Administration Console. In the sidebar section labeled Data, click the Text Search link to see a list of the application's indexes. Clicking an index name displays the documents that index contains. You'll see all the defined schema fields for the index; for each document with a field of that name, you'll see the field's value. You can also issue queries on the index data directly from the Administration Console.

Search API quotas and pricing

The Search API has a free quota of 20,000 API calls per day. The quota only applies to these calls:

When indexing multiple documents in a single call, the call count is increased by the number of documents indexed.

If you enable billing for your app you will be charged for additional usage. API usage is counted and billed in different ways depending on the type of call:

  • Index.search(): There are separate quotas for simple and complex queries. A query is complex if its query string includes the name of a geopoint field or at least one OR or NOT boolean operator. A query is also complex if it uses query options to specify non-default sorting or scoring, field expressions, or snippets. Otherwise the query is simple.
  • Index.put(): When you add documents to indexes the size of each document counts towards the indexing quota.
  • All other Search API calls are counted by the number of operations they involve. These calls are subject to a daily limit of of 1,000 operations per day. The number of operations charged depends on the call:
    • search.get_indexes(): 1 op billed for each index actually returned, or 1 op if nothing is returned.
    • Index.get() and Index.get_range(): 1 op billed for each document actually returned, or 1 op if nothing is returned.
    • Index.delete(): 1 op billed for each document in the request, or 1 op if the request is empty.

Free Quotas and pricing are detailed in the table below. For more information on quotas, see Quotas, and the Quota Details section of the Admin Console.

Resource or API Call Free Quota Pricing
Total Storage (Documents and Indexes) 0.25 GB $0.18 per GB per month
Simple Search Queries 1,000 queries per day $0.13 per 10K queries
Complex Search Queries 100 queries per day $0.60 per 10K queries
Put Requests 0.01 GB per day $2.00 per GB
Other API Calls (billed as operations) 1,000 operations per day $0.10 per 10K operations

The Search API imposes these throughput limits to ensure the reliability of the service:

  • 100 minutes of search execution time per minute
  • 15K Documents added/deleted per minute

Note that although these limits are enforced by the minute, the Admin Console displays the daily totals for each. Customers with Silver, Gold, or Platinum support can request higher throughput limits by contacting their support representative.

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.