The Google Cloud Search Indexing Queues

The Connector SDK and Cloud Search REST API allow the creation of Cloud Search Indexing Queues used to perform the following tasks:

  • Maintain the per-document state (status, hash values, and so on) which can be used to keep your index in sync with your repository.

  • Maintain a list of items to be indexed as discovered during the traversal process.

  • Prioritize items in queues based on item status.

  • Maintain additional state information for efficient integration such as checkpoints, change token, and so on.

A queue is a label assigned to an indexed item, such as "default" for the default queue or "B" for queue B.

Status & priority

A document’s priority in a queue is based on its ItemStatus code. Following are the possible ItemStatus codes in order of priority (handled first to handled last):

  • ERROR - Item encountered asynchronous error during the indexing process and needs to be re-indexed.

  • MODIFIED - Item that was previously indexed and has since been modified in the repository since the last indexing.

  • NEW_ITEM - Item that is not indexed.

  • ACCEPTED - Document that was previously indexed and has not changed in the repository since the last indexing.

When two items in a queue have the same status, higher priority is given to the items that have been in the queue for the longest period of time.

Overview of using indexing queues to index a new or changed item

Figure 1 shows the steps in indexing a new or changed item using an indexing queue. These steps show REST API calls. For equivalent SDK calls, refer to Queue operations (Connector SDK).

Overview of Google Cloud Search indexing
Figure 1. Indexing steps to add or update an item
  1. The content connector uses items.push to push items (metadata and hash) into an indexing queue to establish the item's status (MODIFIED, NEW_ITEM, DELETED). Specifically:

    • When pushing, the connector explicitly includes a push type or contentHash.
    • If the connector doesn't include the type, then Cloud Search automatically uses the contentHash to determine the item's status.
    • If the item is unknown, the item status is set to NEW_ITEM.
    • If the item exists and hash values match, the status is kept as ACCEPTED.
    • If the item exists and the hashes differ, the status becomes MODIFIED.

    For more information on how item status is established, refer to the Traversing the GitHub repositories sample code in the Cloud Search getting started tutorial.

    Usually, the push is associated with content traversal and/or change detection processes in the connector.

  2. The content connector uses items.poll to poll the queue to determine items to index. Cloud Search tells the connector which items are most in need of indexing, sorted first by status code and then by time-in-queue.

  3. The connector retrieves these items from the repository and builds index API requests.

  4. The connector uses items.index to index the items. The item only enters the ACCEPTED state after Cloud Search successfully finishes processing the item.

A connector can also delete an item if it no longer exists in the repository, or push an item again if it's not modified or if there is a source repository error. For information on item deletions, see the next section.

Overview of using indexing queues to delete an item

The full-traversal strategy uses a two-queue process to index items and detect deletions. Figure 2 shows the steps in deleting an item using two indexing queues. Specifically, Figure 2 shows the second traversal performed using a full-traversal strategy. These steps use the REST API calls. For equivalent SDK calls, refer to Queue operations (Connector SDK).

Overview of Google Cloud Search indexing
Figure 2. Deleting items
  1. On initial traversal, the content connector uses items.push to push items (metadata and hash) into an indexing queue, "queue A" as NEW_ITEM as it doesn't exist in the queue. Each item is assigned the label "A" for "queue A." The content is indexed into Cloud Search.

  2. The content connector uses items.poll to poll queue A to determine items to index. Cloud Search tells the connector which items are most in need of indexing, sorted first by status code and then by time-in-queue.

  3. The connector retrieves these items from the repository and builds index API requests.

  4. The connector uses items.index to index the items. The item only enters the ACCEPTED state after Cloud Search successfully finishes processing the item.

  5. The deleteQueueItems method is called on "queue B." But, no items have been pushed to queue B, so nothing can be deleted.

  6. On the second full traversal, the content connector uses items.push to push items (metadata and hash) into queue B:

    • When pushing, the connector explicitly includes a push type or contentHash.
    • If the connector doesn't include the type, then Cloud Search automatically uses the contentHash to determine the item's status.
    • If the item is unknown, the item status is set to NEW_ITEM and the queue label is changed to "B."
    • If the item exists and hash values match, the status is kept as ACCEPTED and the queue label is changed to "B."
    • If the item exists and the hashes differ, the status becomes MODIFIED and the queue label is changed to "B."
  7. The content connector uses items.poll to poll the queue to determine items to index. Cloud Search tells the connector which items are most in need of indexing, sorted first by status code and then by time-in-queue.

  8. The connector retrieves these items from the repository and builds index API requests.

  9. The connector uses items.index to index the items. The item only enters the ACCEPTED state after Cloud Search successfully finishes processing the item.

  10. Finally, deleteQueueItems is called on queue A to delete all previously indexed Cloud Search items that still have a queue "A" label.

  11. With subsequent full traversals, the queue used for indexing and the queue used for deleting are swapped.

Queue operations (Connector SDK)

The Content Connector SDK provides operations for pushing items to, and pulling items from, a queue.

To package and push an item to a queue, use the pushItems builder class.

You do not need to do anything specific to pull items from a queue for processing. Instead, the SDK automatically pulls items from the queue, in priority order, using the Repository class's getDoc method.

Queue operations (REST API)

The REST API provides the following two methods for pushing items to and pulling items from a queue:

You can also use Items.index to push items to the queue during indexing. Items pushed to the queue during indexing don’t require a type and are automatically assigned a status of ACCEPTED.

Items.push

The Items.push method adds IDs to the queue. This method can be called with a specific type value which determines the result of push operation. For a list of type values, refer to the item.type field in the Items.push method.

Pushing a new ID results in adding a new entry with an NEW_ITEM ItemStatus code.

The optional payload is always stored, treated as an opaque value, and returned from Items.poll.

When an item is polled, it is reserved meaning it cannot be returned by another call to Items.poll. Using Items.push with type as NOT_MODIFIED, REPOSITORY_ERROR, or REQUEUE, unreserves polled entries. For further information about reserved and unreserved entries, refer to the Items.poll section..

Items.push with hashes

The Cloud Search REST API supports specifying metadata and content hash values on Items.index requests. Instead of specifying type, the metadata and/or content hash values can be specified with a push request. The Cloud Search Indexing Queue compares the provided hash values with the stored values available with the item in the data source. If mismatched, that entry is marked as MODIFIED. If a corresponding item doesn't exist in the index, then the status is NEW_ITEM.

Items.poll

The Items.poll method retrieves the highest priority entries from the queue. The requested and returned status values indicate the status(es) of the priority queue(s) requested or the status of the returned IDs.

By default, entries from any section of the queue may be returned, based on priority. Each returned entry is reserved, and is not returned by other calls to Items.poll until one of the following cases is met:

  • The reservation times out.
  • The entry is enqueued again by Items.index.
  • Items.push is called with a type value of NOT_MODIFIED, REPOSITORY_ERROR, or REQUEUE.