Google-supplied connector parameters

Every connector has an associated configuration file containing parameters used by the connector, such as the ID for your repository. Parameters are defined as key-value pairs, such as api.sourceId=1234567890abcdef.

The Google Cloud Search SDK contains several Google-supplied configuration parameters used by all connectors. Of the Google-supplied configuration parameters, only the Data source access parameters are required to be defined in your configuration file. You do not need to redefine the Google-supplied parameters in your configuration file unless you want to override their default values.

This reference describes the Google-supplied configuration parameters.

Configuration file example

The following example shows a identity configuration file with parameter key-value pairs.

#
# Configuration file sample
#
api.sourceId=1234567890abcdef
api.identitySourceId=0987654321lmnopq
api.serviceAccountPrivateKeyFile= ./PrivateKey.json

#
# Traversal schedules
#
schedule.traversalIntervalSecs=7200
schedule.incrementalTraversalIntervalSecs=600
#
# Default ACLs
#
defaultAcl.mode=fallback
defaultAcl.public=true
  

Commonly set parameters

This section lists required and optional commonly set configuration parameters. If you do not change values for the optional parameters, the connector uses the default values provided by the SDK.

Data source access

The following table lists all of the parameters that are required to appear in a configuration file. The parameters you use depend on the type of connector you are building (content connector or identity connector).

Setting Parameter
Data source id api.sourceId=1234567890abcdef

This parameter is required by a connector to identify the location of your repository. You obtain this value when you added a data source to search. This parameter must be in connector configuration files.

Identity source id api.identitySourceId=0987654321lmnopq

This parameter is required by identity connectors to identify the location of an exteranal identity source. You obtained this value when you map user identities in cloud search. This parameter must be in all identity connector configuration files.

Service account private key file api.serviceAccountPrivateKeyFile= ./PrivateKey.json

This parameteter contains the private key needed to access the repository. You obtained this value when you configured access to the Google Cloud Search REST API. This parameter must be in all configuration files.

Service account ID api.serviceAccountId=123abcdef4567890

This parameter specifies the service account ID. The default empty string value is only allowable when the configuration file specifies a private key file parameter. This parameter is required if your private key file is not a JSON key.

G Suite Account ID api.customerId=123abcdef4567890

This parameter specifies the account ID for the enterprise's G Suite account. You obtained this value when you map user identities in cloud searchThis parameter is required when syncing users using an identity connector.

Root URL api.rootUrl=https://cloudsearch.googleapis.com

This parameter specifies the indexing service base URL path.

The default value for this parameter is an empty string which is converted to https://cloudsearch.googleapis.com.

Traversal schedules

The scheduling parameters determine how often the connector waits between traversals.

Setting Parameter
Full traversal at connector startup schedule.performTraversalOnStart=false

The connector performs a full traversal at connector startup, rather than waiting for the first interval to expire. The default value is true.

Full traversal after an interval schedule.traversalIntervalSecs=7200

The connector performs a full traversal after a specified interval. Specify the interval between traversals in seconds. The default value is 86400 (number of seconds in one day).

Exit after a single traversal connector.runOnce=true

The connector runs a full traversal once, then exits. The default value is false (do not exit after a single traversal).

Incremental traversal after an interval schedule.incrementalTraversalIntervalSecs=600

The connector performs an incremental traversal after a specified interval. Specify the interval between traversals in seconds. The default value is 300 (number of seconds in 5 minutes).

Scheduled poll queue intervals schedule.pollQueueIntervalSecs=120

The interval between scheduled poll queue intervals (in seconds). This is used only by a listing traversal connector. The default value is 10.

Access control lists

The connector controls access to items by using ACLs. Multiple parameters allow you to protect user access to indexed records with ACLs.

If your repository has individual ACL information associated with each item, upload all ACL information to control item access within Cloud Search. If your repository provides partial or no ACL information, you can supply default ACL information in the following parameters, which the SDK provides to the connector.

Setting Parameter
ACL mode defaultAcl.mode=override

Determines when to apply the default ACL. Valid values:

  • none: do not use default ACL
  • fallback: use default ACL only if no ACL already present
  • append: add default ACL to existing ACL
  • override: replace existing ACL with default ACL

    The default mode is none.

Default public ACL defaultAcl.public=true

The default ACL used for the entire repository is set to public domain access. The default value is false.

Common ACL group readers defaultAcl.readers.groups=google:group1@mydomain.com, group2
Common ACL readers defaultAcl.readers.users=user1, user2, google:user3@mydomain.com
Common ACL denied group readers defaultAcl.denied.groups=group3
Common Acl denied readers defaultAcl.denied.users=user4, user5
Entire domain access To specify that every indexed record be publicly accessible by every user in the domain, set both of the following parameters with values:
  • defaultAcl.mode=override
  • defaultACL.public=true
Common defined ACL To specify one ACL for each record of the data repository, set all of the following parameter values:
  • defaultAcl.mode=fallback
  • defaultAcl.public=false
  • defaultAcl.readers.groups=google:group1@mydomain.com, group2
  • defaultAcl.readers.users=user1@mydomain.com, user2, google:user3@mydomain.com
  • defaultAcl.denied.groups=group3
  • defaultAcl.denied.users=user4, user5

    Every specified user and group is assumed to be a local domain defined user/group unless prefixed with "google:" (literal constant).

    The default user or group is an empty string. Supply user and group parameters only if defaultAcl.public is set to false. To list multiple groups and users, use comma-delimited lists.

    If defaultAcl.mode is set to none, records are unsearchable without defined individual ACLs.

Metadata Configuration Parameters

Some of the item metadata may be configurable. Connectors may also set these fields, in which case the configuration will be ignored. If the connector-supplied value is missing or empty, however, the configuration below will be applied. Each field can be set using a named metadata attribute that the connector provides (using the .field suffix), or with a fixed, default value (using the .defaultValue suffix), which will be used if the given field is missing or its value is empty. The following table shows these parameters.

Setting Parameter
Title itemMetadata.title.field=movieTitle
itemMetadata.title.defaultValue=Gone with the Wind
The metadata attribute that contains the value corresponding to the document title. The default value is an empty string.
URL itemMetadata.sourceRepositoryUrl.field=url
itemMetadata.sourceRepositoryUrl.defaultValue=https://www.imdb.com/title/tt0031381/
The metadata attribute that contains the value for the document URL for search results.
Created timestamp itemMetadata.createTime.field=releaseDate
itemMetadata.createTime.defaultValue=1940-01-17
The metadata attribute that contains the value for the document creation timestamp.
Last modified time itemMetadata.updatetime.field=releaseDate
itemMetadata.updatetime.defaultValue=1940-01-17
The metadata attribute that contains the value for the last modification timestamp for the document.
Document language itemMetadata.contentLanguage.field=languageCode
itemMetadata.contentLanguage.defaultValue=en-US
The content language for documents being indexed.
Schema object type itemMetadata.objectType=movie
The object type used by the site, as defined in the data source schema object definitions. The connector won't index any structured data if this property is not specified.

Note: This configuration property points to a value rather than a metadata attribute, and the .field and .defaultValue sufffixes are not supported.

Datetime formats

Datetime formats specify the formats expected in metadata attributes. If the configuration file does not contain this parameter, default values are used. The following table shows this parameter.

Setting Parameter
Additional datetime formats structuredData.dateTimePatterns=MM/dd/uuuu HH:mm:ssXXX
A semicolon-separated list of additional java.time.format.DateTimeFormatter patterns. The patterns are used when parsing string values for any date or date-time fields in the metadata or schema. The default value is an empty list, but RFC 3339 and RFC 1123 formats are always supported.

Structured data

The Cloud Search Indexing API provides a schema service that you can use to customize how Cloud Search indexes and serves your data. If you are using a local repository schema, you must specify the structured data local schema name.

Setting Parameter
Local schema name structuredData.localSchema=mySchemaName

The schema name is read from the data source and used for repository structured data.

The default is an empty string.

Content and search quality

For repositories that contain record or field based content (such as a CRM, CVS, or database), the SDK allows automatic HTML formatting for data fields. Your connector defines the data fields at the beginning of connector execution, and then uses a content template to format each data record before uploading it to Cloud Search.

The content template defines the importance of each field value for searching. The HTML <title> field is required and defined as the highest priority. You can designate search quality importance levels for all the other content fields: high, medium or low. Any content field not defined in a specific category defaults to low priority.

Setting Parameter
Content HTML title contentTemplate.templateName.title=myTitleField

The content HTML title and highest search quality field. This parameter is required only if you are using an HTML content template. The default value is an empty string.

High search quality for content fields contentTemplate.templateName.quality.high=hField1,hField2

Content fields given a high search priority. The default is an empty string.

Medium search quality for content fields contentTemplate.templateName.quality.medium=mField1,mField2

Content fields given a medium search priority. The default is an empty string.

Low search quality for content fields contentTemplate.templateName.quality.low=lField1,lField2

Content fields given a low search priority. The default is an empty string.

Unspecified content fields contentTemplate.templateName.unmappedColumnsMode=IGNORE

How the connector handles unspecified content fields. Valid values are:

  • APPEND—append unspecified content fields to the template
  • IGNORE—ignore unspecified content fields

    The default value is APPEND.

Uncommonly set parameters

You rarely need to set the parameters listed in this section. The parameters's defaults are set for optimal performance. Google does not recommend setting these parameters to values different from their defaults without specific requirements within your repository.

Traversers

The SDK enables you to specify multiple individual traversers to allow for parallel traversals of a data repository. The SDK template connectors use this feature.

Setting Parameter
Thread pool size traverse.threadPoolSize=10

Number of threads the connector will create to allow for parallel processing. A single iterator fetches operations serially (typically RepositoryDoc objects), but the API calls processes in parallel using this number of threads.

The default value is 5.

Traverser poll requests

The core of the Cloud Search indexing queue is a priority queue containing an entry for each item known to exist. A listing connector can request to poll items from the indexing API. A poll request gets the highest priority entries from the indexing queue.

The following parameters are used by the SDK listing connector template to define polling parameters.

Setting Parameter
Repository traverser repository.traversers=traverseType1,traverseType2

A list of prefix values, each of which specifies an individual traverser.XXX.[hostload |pollRequest.queue |pollRequest.statuses | timeout | timeunit].

The ListingConnector template uses this parameter. If you do not specify this parameter in the configuration file, the traverser.[hostload |pollRequest.queue |pollRequest.statuses | timeout | timeunit] parameters are used instead. Default is an empty string.

Queue to be polled traverser.pollRequest.queue=mySpecialQueue

Queue names that this traverser polls. The default is empty string (implies "default").

traverser.%s.pollRequest.queue=mySpecialQueue

A value for the specific traverser defined by the string %s .

Polling behavior traverser.pollRequest.limit=10

Maximum number of items to return from a polling request. The default value is 0 (implies the API maximum).

traverser.%s.pollRequest.limit=10

A value for the specific traverser defined by the string %s.

Item status traverser.pollRequest.statuses=MODIFIED,NEW_ITEM

The specific item's status that this traverser polls. The default is an empty string (implies all status values).

traverser.%s.pollRequest.statuses=MODIFIED,NEW_ITEM

A value for the specific traverser defined by the string %s.

Host load traverser.hostload=10

Maximum number of active parallel threads available for polling. The default value is 5.

traverser.%s.hostload=15

A value for the specific traverser defined by the string %s.

Timeout traverser.timeout=30

Timeout value for interrupting this traverser poll attempt.

The default value is 60.

traverser.%s.timeout=120

A value for the specific traverser defined by the string %s.

traverser.timeunit=SECONDS

The timeout units.

traverser.%s.timeunit=SECONDS

A value for the specific traverser defined by the string %s.

In most cases, a connector using the SDK listing connector template only requires a single set of parameters for polling. In some cases, you may need to define more than one polling criteria if your traversal algorithm requires separating item processing using different queues, for example.

In this case, you have the option of defining multiple sets of polling parameters. Begin by specifying the names of the parameter sets using repository.traversers. For each defined traverser name, supply the configuration file with the parameters in the table above replacing the %s with the traverser name. This creates a set of polling parameters for each defined traverser.

Checkpoints

A checkpoint is useful for tracking the state of an incremental traversal.

Setting Parameter
Checkpoint directory connector.checkpointDirectory=/path/to/checkpoint

Specifies the path to the local directory to use for the incremental and full traversal checkpoints.

Content uploads

Item content is uploaded to Cloud Search with the item when the content's size does not exceeds the specified threshold. If the content's size exceeds the threshold, the content is uploaded separately from the item's metadata and structured data.

Setting Parameter
Content threshold api.contentUploadThresholdBytes=50000

The threshold for content that determines whether it is uploaded "in-line" with the item versus using a separate upload.

The default value is 100000 (~100KB).

Containers

The full connector template uses an algorithm involving the concept of a temporary data source "container" for detecting deleted records in the database. This means that upon each full traversal, the fetched records, which are in a new container, replace all the existing Cloud Search records indexed from the previous traversal, which are in an old container.

Setting Parameter
Container name tag traverse.containerTag=my_database_name

To run multiple instances of the connector in parallel to index a common data repository (whether on different data repositories or separate parts of a common data repository) without interfering with each other, assign a unique container name tag to each run of the connector. A unique name tag prevents a connector instance from deleting another's records.

The name tag is appended to the Full Traversal Connector toggle container id. This tag is appended to a fixed string "FullTraversal||" with toggle indicators "A:" and "B:". The default value is an empty string.

Disable delete detection traverse.useContainers=false

By default, the connector uses the container algorithm to detect deleted records. Disabling delete detection might be useful if you have configured multiple connector instances to index a common data repository and the records are unchanging or if the data repository is using a separate delete detection implementation.

The default value is true, which specifies that containers should be used.

Batch policy

The SDK supports a batch policy that enables you to perform the following actions:

  • Batch requests
  • Specify the number of requests in a batch queue
  • Manage concurrently executing batches
  • Flush batched requests

The SDK batches together the connector's requests to speed throughput during uploads. The SDK trigger for uploading a batch of requests is by either the number of requests or the timeout, whichever comes first. For example, if the batch delay time has expired without the batch size being reached, or if the batch size number of items is reached before the delay time expires, then the batch upload is triggered.

Setting Parameter
Batch requests batch.batchSize=10

Batch requests together. The default value is 10.

Number of requests in a batch queue batch.maxQueueLength=500

Maximum number of requests in a batch queue for execution. The default value is 1000.

Concurrently executing batches batch.maxActiveBatches=5

Number of allowable concurrently executing batches. The default value is 20.

Flush batched requests automatically batch.maxBatchDelaySeconds=5

Number of seconds to wait before batched requests are flushed automatically. The default value is 5.

Flush batched requests on shutdown batch.flushOnShutdown=false

Flush batched requests during service shutdown. The default value is true

Exception handlers

The exception handlers parameters determine how the traverser proceeds after it encounters an exception.

Setting Parameter
Traverser instruction in case of error traverse.exceptionHandler=100

How the traverser should proceed after an exception is thrown. Valid values are:

  • 0--always abort the traversal after encountering an exception
  • ignore--ignore the error
  • Number of exceptions (for example, 10)--abort after the traverser encounters the specified number of exceptions

    The default value is 0 (always abort on error).

Wait time between exceptions abortExceptionHander.backoffMilliSeconds=100

Backoff time in milliseconds to wait between detected handler exceptions (typically used when traversing a repository). The default value is 10.

Send feedback about...

Cloud Search
Cloud Search