Index your data

Google Cloud Search provides cloud-based search capability over G Suite data. The Cloud Search Indexing API extends Cloud Search search to include third-party data stored in third-party repositories.

A typical integration between the Cloud Search Indexing API and third-party content repositories involves development of a custom application—often referred to as a connector—that performs the following:

  1. Pulls data from third-party content repositories.
  2. Pushes third-party content into Cloud Search
  3. Optionally, listens to change notifications from third-party content repositories and turns these notifications into indexing requests to keep the Cloud Search index in sync with third-party repository.

Alternatively, such integrations can be built into third-party repositories to push content directly to the Cloud Search Indexing API.

Integrations of third-party content repositories with the Cloud Search indexing API can be developed and hosted by content providers, independent system integrators, or Cloud Search customers themselves.

Items

The Cloud Search Indexing API refers to each indexable resource as an item. An item in Cloud Search index consists of:

  • Indexable content
  • Associated metadata
  • ACLs / permissions

As you index each item, it must be identified using a unique identifier (typically based on the id field from a third-party content repository). Each indexing request must include that unique identifier as part of the request.

The Cloud Search Indexing API does not support partial updates. Fields with no provided values are cleared out in the update. All metadata and ACLs should always be specified when updating item content, even if they are not changing.

Synchronous v. asynchronous mode

The Cloud Search Indexing API enforces quota checks for indexing requests for each customer. For indexing requests, the API has two types of quotas that have different indexing-to-serving latencies.

  • Synchronous mode: Shorter indexing-to-serving latency but limited throughput quota (5 MB per hour). Recommended for indexing of updates and changes to the repository .
  • Asynchronous mode: Higher indexing-to-serving latency but significantly larger quota / throughput for indexing requests. Recommended for initial indexing (backfill) of the entire repository.

More details about the synchronous mode quota are explained in Customer Quota Specifications.

Refer to the Cloud Search API REST Resources section for additional details regarding Items.index.

ACLs

The Cloud Search Indexing API allows documents to be indexed with their respective ACLs (Access Control Lists) from third-party systems. The API provides a rich set of ACL constructs that is powerful enough to model the ACLs of most repositories.

The Cloud Search Indexing API supports single-domain ACLs. It does not support cross-domain ACLs.

ACL entries

Access may be granted or denied to users, groups, or the entire G Suite domain. To grant access to the entire G Suite domain, use the gsuiteDomain principal as a user ID or group ID.

User and group ID mapping

Google Cloud Search enforces ACLs based on the Google user performing the query and their group memberships.

When crawling repository documents, connectors may not readily have the mapping of repository's user IDs to Google email addresses. Hence the Cloud Search Indexing API provides an intermediate ID called userResourceName, which can be a string representing the user identity in the external system. Principals can be set to external user IDs in Items.index calls. Then a mapping of an external user ID to a Google email address can be provided on a separate process via a Directory API request.

Note that the user won't get access to repository documents until the external ID is mapped to a Google email address.

[Note: Group mappings are TBD.]

ACLs and the Items.index method

The Cloud Search Indexing API Items.index method contains the following ACL-related properties/fields in the Item object:

  • inheritAclFrom — The document from which to inherit its ACL.

  • inheritanceType — The inheritance type determines how the ACL set on a document is evaluated with respect to the inheritAclFrom document above.

  • readers[] — List of allowed principals for this document.

  • deniedReaders[] — List of denied principals for this document. This field overrides any principals in the readers[] field.

  • containerName — The hierarchical parent of the current document. Deleting the containerName document will automatically delete that document and all of its descendents, including this document. Note that this document does not inherit any ACL information from the containerName document.

In Items.index, specifying inheritAclFrom will allow the item to inherit the ACL from another resource.

ACL inheritance

Many third-party systems that may be used as source repositories have a concept of ACL inheritance, where the effective authorization decision for a specific document and a specific user is a result of a combination of the ACLs of the document itself and the ACLs of its inheritance chain. This authorization decision is determined with different rules depending on the third-party system and properties of the document.

inheritAclFrom field

Each document may have direct allowed principals and direct denied principals, specified in the readers[] and deniedReaders[] fields. A document may also inherit indirect allowed principals and indirect denied principals through the inheritAclFrom field.

In the example in Figure 1:

  • User 1 is an allowed principal of Doc A.
  • User 2 is an allowed principal of Doc B.
  • Doc B inherits the ACL of Doc A.
  • User 1 does not need to be specified explicitly as a principal of Doc B in order to gain access to Doc B; the access will be inherited because User 1 is listed as a principal of Doc A and Doc B inherits from Doc A.
  • User 2 does not have access rights to Doc A.
Drawing of connections between documents
Figure 1. The `inheritAclFrom` property in use

Inheritance type

inheritanceType should only be specified if the inheritAclFrom field is specified. The inheritanceType refers to the rule that determines how a child's ACL combines with its container's ACL. Cloud Search currently supports three inheritance types:

Child override

The child_override inheritance type specifies that the child's direct allowed principals take precedence over the parent's denied principals and the child's direct denied principals take precedence over the parent's allowed principals. This is the default inheritance type when an inheritance type is not specified.

Parent override

The parent_override inheritance type specifies that the parent's allowed principals take precedence over the child's direct denied principals and the parent's denied principals take precedence over the child's direct allowed principals.

BOTH_PERMIT

The BOTH_PERMIT inheritance type specifies that in order to be authorized, the user must be an allowed principal of both the child and the parent.

Authorization tables

The table below shows the overall authorization decision for the three inheritance types given the authorization decision of the parent and child. The parent's effective authorization decision is given in the left column, while the child's direct authorization decision is given in the top row. The abbreviations for "allow", "deny", and "indeterminate" are +, -, and ? respectively. "Indeterminate" indicates that the user does not have an explicit ACL decision; that is, the user was not passed in the readers[] or denied_readers[] properties for that request.

Child Override Parent Override Both Permit
P\C + - ?
+ + - +
- + - -
? + - ?
P\C + - ?
+ + + +
- - - -
? + - ?
P\C + - ?
+ + - -
- - - -
? - - -

When evaluating an ACL inheritance chain, the order of evaluation can change the outcome of the authorization decision. Cloud Search provides leaf-to-root order of evaluation for ACL inheritance chains. The ACL decision for a chain begins by evaluating a leaf with its parent, all the way to the root.

Containment and document deletion

Just as the inheritAclFrom resource defines an access hierarchy, the Indexing API's containerName resource allows you to specify a containment hierarchy. The containment relationship is used to propagate deletions in Google's data stores. When a document is indexed, its containerName resource specifies the desired parent item. If the parent item, or its parent, etc., is deleted, the current document will be deleted too.

ACL inheritance v. containment relationship

Uploaded documents have a physical hierarchy structure which is specified by the containerName property of the document and an ACL inheritance structure which is specified by the inheritAclFrom property of the document. In many cases the containerName and inheritAclFrom references will be the same, but this is not required. In some cases, the ACL inheritence may be based on a parent of the container item.

The containment relationship is a totally separate concept from ACL inheritance and has no ACL implication. The inheritAclFrom filed can specify a different file than the containerName field's specified file.

That is, an item can be contained by a folder for the purpose of deletion, but inherit the ACL from a different folder. Deleting a folder does not delete descendents that inherits its ACL, unless that folder is also in the containment hierarchy for the descendant.

This separation of ACL inheritance from the containment hierarchy provides a flexible structure that can model many different existing structures.

It is not required to provide either the inheritAclFrom or the containerName property.

When a resource is successfully deleted:

  • Any resource that has specified the deleted resource's id in its containerName field will become unsearchable and will be scheduled for deletion from Google's data stores.
  • Any resource that has specified the deleted resource's id in its inheritAclFrom field will become unsearchable.

If a resource has a deleted resource id specified as the inheritAclFrom property, but it does not have a containerName property specified, or its containment hierarchy contains no deleted resources, that resource and its data will remain in Google's data stores. The customer is responsible for deleting this data.

Document hierarchy

Figure 2 shows an example of a document containment hierarchy. In Figure 1:

  • User 1 is an allowed principal of Doc A.
  • User 2 is an allowed principal of Doc B.
  • User 3 is an allowed principal of Doc C.
  • Doc C inherits the ACL of Doc A.
  • Doc B names Doc A as its container.
  • Doc C names Doc B as its container.
  • Indirect access comes from the inheritAclFrom reference.
    • User 1 can access Doc C because Doc C inherits the ACL of Doc A.
  • Indirect access does not come from the containerName reference.
    • User 2 cannot access Doc C.
Drawing of connections between documents
Figure 2. The `containerName` property in use

Figure 3 shows an example of how deletion works for a document hierarchy.

In Figure 3:

  • User 1 is an allowed principal of Doc A.
  • User 2 is an allowed principal of Doc D.
  • Doc D and Doc E both inherit the ACL of Doc A.
  • Doc D names Doc A as its container.
  • Documents A and E are root-level documents as they do not have a container document.
  • Deletes cascade through the container references.
Drawing of connections between documents
Figure 3. Deleting a document does not automatically delete documents that inherit its ACLs.

When Doc A is deleted: * All descendants of the inheritFrom reference will lose access for all users. * No users can access Doc A. * Doc D is implicitly deleted. No users can access Doc D. * Doc E is not deleted, as deletion only cascades through container references. * Doc E becomes unreachable and no users will be able to search for Doc E.

Cloud Search Indexing Queue

The Cloud Search Indexing API provides additional set of methods to maintain additional state for each item in index as well as overall state for repository traversal. These methods collectively form the Cloud Search Indexing Queue, which facilitates:

  1. Maintaining the per-item state (status, hash values, additional information required for integration) which can be used to keep the index in sync with your third-party content repositories.
  2. Maintaining a list of items to be indexed as discovered in traversal process.
  3. Priority queues based on item status to determine the next set of items to be indexed.
  4. Maintaining additional state information for efficient integration such as checkpoints, change token, etc.

Your particular integration might not use any of the functionality provided by the Cloud Search Indexing Queue. Instead your integration may rely on local state and/or repository APIs or change notifications to keep the index in sync.

To facilitate the ease of writing integrations, the Cloud Search Indexing Queue supports storing and retrieving of item state. This allows integrations to effectively keep track of sets of items categorized by priority and schedule indexing without storing any local state.

Priority queues

The core of the Cloud Search Indexing Queue is a priority queue containing an entry for each item known to exist, and operations to enqueue entries and poll entries from the queue.

Status

Priority is based on the status of the entries in a queue. Here are the possible status values:

  • serverError — Enqueued items encountered asynchronous errors during the indexing process and need to be re-indexed.
  • modified — Enqueued items that were previously indexed and are known to be modified.
  • unindexed — Enqueued items that are not indexed.
  • indexed — Other indexed items.

The secondary ordering is age, from oldest to newest (FIFO) based on the time the item was enqueued in its current state by any API operation.

Operations on the priority queue

There are two operations on the queue:

  • Items.push — Adds IDs to the queue
  • Items.poll — Gets the highest priority entries from the queue

Items.push

The Items.push method adds IDs to the queue, if not present. This method can be called with a specific kind value which determines the result of push operation.

  • modified — Specify kind as modified to indicate the document is modified in the content repository and should be reindexed. This results in the entry being marked as "modified" in the Cloud Search Indexing Queue.
  • notModified — Specify kind as notModified to indicate the latest version of the item is already available in the index. This results in the entry being marked as indexed in the Cloud Search Indexing Queue.
  • repositoryError — Specify kind as repositoryError to indicate an error retrieving the item from its content repository. This results in incrementing the error count value which is used for exponential backoff. Optionally use a repositoryError message to save additional information about the error from the repository.
  • requeue — Specify kind as requeue to keep the item in its current status. This moves the entry to end of the queue for its current status.

If the entry is not present in the queue, then kind is ignored. This results in adding a new entry with an unindexed status.

The optional payload is always stored, treated as an opaque value, and returned from Items.poll.

An Items.pollrequest not only returns the highest priority entries from the queue but also marks such entries as reserved. Using Items.push with kind as notModified, repositoryError, or requeue, will unreserve polled entries when applicable.

Items.push with hashes

The Cloud Search Indexing API supports specifying metadata and content hash values on Items.index requests.

Instead of specifying kind, the metadata and/or content hash values can be specified with a push request. The Cloud Search Indexing Queue compares the provided hash values with the stored values available with the item in the index. If mismatched, that entry is marked as modified. If a corresponding item doesn't exist in the index, then the hash values are ignored.

Items.poll

The Items.poll method gets the highest priority entries from the queue. The requested and returned status values indicate the status(es) of the priority queue(s) requested or the status of the returned IDs, and can be one of the four status values:

  • serverError
  • modified
  • unindexed
  • indexed

By default, entries from any section of the queue may be returned, based on priority. Each returned entry is reserved, and will not be returned by other calls to Items.poll until the reservation times out due to the connector not responding to the work, or the entry is enqueued again by Items.index, or Items.push with one of the following kind values:

  • notModified
  • repositoryError, after the error delay
  • Requeue

Send feedback about...

Cloud Search
Cloud Search