Content Connector SDK Guide

This guide is intended for developers who are creating a connector using the Google Cloud Search SDK. It contains detailed descriptions of many of the GCS SDK objects, arranged by key functioality. It also assumes you are familiar with Google Cloud Search, connector concepts, and object-oriented programming language(s) and concepts.

Accessing the connector

To access the Cloud Search SDK from your program, you'll use the IndexingApplication class. Every connector begins execution by creating an IndexingApplication instance of and calling its start() method. This hands off processing to the SDK for making connector calls as determined by the settings that you have supplied in your configuration file for the connector.

You instantiate your application with a builder to do one of the following:

  • Create a new connector instance. Use this approach if you are building a connector from scratch writing your own traversal implementation.
  • Create a new repository using a template class. Use this approach to use a predetermined traversal strategy.

Create a new connector from Connector main()

Implement your application with a builder that passes in an instance of the current connector as a required parameter. The following code snippet shows a typical approach to initializing the SDK for a connector called MyConnector.

public class MyConnector implements IndexingConnector {
  // ...
  public static void main(String[] args)
     throws IOException, InterruptedException {

     IndexingApplication application =
         new IndexingApplication.Builder(new MyConnector(), args)
             .build();
     application.start();
   }

 // Followed by all the overridden Connector methods.
 // ...
 }

After the connector calls application.start(), the SDK gains execution control and proceeds to:

  • read the configuration file to create an IndexingService object
  • schedule traversals
  • call the Connector.init() method

Call a template class from Repository main()

If you are using the SDK template classes in your connector (recommended), pick a connector traversal type (ListingConnector or FullTraversalConnector) to pass to the IndexingApplication instance. Then from your Repository code, build the IndexingApplication instance and call IndexingApplication.start().

The following example uses the ListingConnector template object. For a full traversal strategy, just replace ListingConnector with FullTraversalConnector.

public class MyRepository implements Repository {
  // ...
  public static void main(String[] args)
      throws IOException, InterruptedException {

    IndexingApplication application = new IndexingApplication.Builder(
          new ListingConnector(new MyRepository()), args)
           .build();
    application.start();
  }

  // Followed by all the overridden Repository methods.
  // ...
}

Of course, you could also place the main() alone in its own class to keep it separate from your Connector or Repository code.

Running the connector

The Connector interface enables you to implement basic connector operations. The SDK calls its methods during:

  • initialization (init())
  • shutdown (saveCheckpoint(), destroy())
  • scheduled traversals (traverse())

The following code snippet shows the main methods of this interface.

public interface IndexingConnector {
  public void init(IndexingConnectorContext context) throws Exception;

  public void traverse()
      throws IOException, InterruptedException;

  public void saveCheckpoint()
      throws IOException, InterruptedException;

  public void destroy();
}

Initialize the connector

The init() method enables the connector to initialize itself before accepting any requests from the SDK. The SDK passes a IndexingConnectorContext object that the connector uses to register TraverserConfiguration objects and access the IndexingService object. The TraverserConfiguration objects schedule polling requests (used only when the connector employs a list traversal strategy). The Connector interface uses the IndexingService object to make document indexing requests to Cloud Search.

Implement a traversal strategy

The traverse() method enables the connector to implement a particular traversal strategy. While executing the traverse() method, the connector interacts with the data repository to fetch all documents that need to be indexed.

If you use the template objects (ListingConnector or FullTraversalConnector), the traversal strategy employed by this connector determines which Repository interface methods to call.

The traverse() method typically implements one of the following traversal strategies:

  • List traversal. A list traversal fetches documents by IDs to push to the Cloud Search Indexing API. You use a registered TraverserConfiguration object to poll these IDs to ensure changes to items in your data repository trigger re-indexing in Cloud Search.

  • Graph traversal. Typically, this traversal strategy pushes a root document ID to the Cloud Search Indexing API and feeds additional child IDs while processing the parent document ID.

  • Full traversal. This traversal strategy typically accesses and indexes every document in the repository for indexing. This strategy is commonly used when you want to index all items in your data repository, such as during the initial index.

Create traversal checkpoints

The saveCheckpoint() method enables the connector to start from where it left off between connector traversals. By committing incremental change tokens to the checkpoint, the connector can also recover missed changes between connector restarts. The SDK always calls the saveCheckpoint() method prior to a destroy() call.

Release traversal resources

The destroy() method enables the connector to release any resources before shutdown.

Retrieving document IDs for indexing

The ItemRetriever interface is only implemented by a Connector object using a list traversal strategy.

During a Connector.traverse() method call, a list traversal retrieves only the data repository document IDs. These document IDs are pushed to the Cloud Search indexing queue, but are not actually indexed yet.

On a schedule registered during the Connector.init() method call, the SDK polls the Cloud Search indexing queue and passes each queued document ID to the ItemRetriever.process() method. At this point, the connector retrieves the document from the data repository and indexes it in Cloud Search.

This interface defines a single method (process()).

public interface ItemRetriever {
  void process(Item item)
     throws IOException, InterruptedException;
}

Retrieve and index a single document

The process() method enables the connector to retrieve a single document from the data repository and index it in Cloud Search.

Similar to the Connector.traverse() method of a connector using a list traversal strategy, the process() method can also handle child documents in a hierarchical repository by retrieving and pushing the child IDs into the Cloud Search indexing queue. These child IDs are processed in the future when the SDK polls for new queued document IDs. For each of these discovered child IDs, the SDK makes a future process() method call.

Enabling incremental traversals

A Connector object that supports change detection within its data repository should implement the IncrementalChangeHandler interface to enable the SDK to make incremental traversal calls. The configuration file defines the schedule for incremental traversal calls.

This interface defines a single method (handleIncrementalChanges()), as shown in the following code snippet.

public interface IncrementalChangeHandler {
  void handleIncrementalChanges()
     throws IOException, InterruptedException;
}

Retrieve only changed documents

The handleIncrementalChanges() method enables the connector to perform an incremental traversal where it only retrieves repository documents that have been added, deleted, or modified since the last traversal. Depending on the traversal strategy that the connector is using, the SDK either pushes changed documents to the Cloud Search indexing queue (list or graph traversal) or directly indexes them into Cloud Search (full traversal).

Giving the connector access to the SDK

The Cloud Search SDK creates a IndexingConnectorContext instance to give the IndexingConnector object access to the SDK. At initialization, the IndexingApplication object calls the IndexingConnector.init() method, passing a IndexingConnectorContext object to the Connector object.

The two main functions the IndexingConnectorContext interface provides the Connector object are:

  • Access to Cloud Search indexing
  • List traverser configuration

Access the IndexingService object

The Connector interface uses the ConnectorContext object to retrieve the SDK created IndexingService object. This object is used later during document traversals to interact with Cloud Search.

Define TraverserConfiguration instances

The Connector interface uses the ConnectorContext object to define how to poll document IDs from the Cloud Search indexing queue when using a list traversal strategy. The Connector object creates TraverserConfiguration objects during the Connector.init() method call from the SDK. The connector calls ConnectorContext.registerTraverser() to configure future poll scheduling.

The passed ConnectorContext object is typically only used during the Connector.init() method call. The Connector object should save a reference to the IndexingService object and make all the necessary ConnectorContext interface method calls for registering TraverserConfiguration objects.

The following code snippet shows some of the main methods of this interface.

public interface IndexingConnectorContext {
  public void registerTraverser(
      TraverserConfiguration configuration);

  public IndexingService getIndexingService();

  public List<TraverserConfiguration>
      getTraverserConfiguration();

  public ExceptionHandler
      getIncrementalTraversalExceptionHandler();

  public ExceptionHandler getTraversalExceptionHandler();
}

Schedule list traversal polling calls

The registerTraverser() method enables the Connector object to schedule SDK list traversal polling calls. Use this method only for list and graph traversal strategies.

The Connector creates one or more TraverserConfiguration object(s) and passes them to the SDK. A typical usage of this method is when the Connector object implements the ItemRetriever interface.

The following code snippet shows how to register a single traverser using the Cloud Search SDK default parameters.

public class MyConnector implements IndexingConnector, ItemRetriever {

  // ...
  @Override // from the Connector interface
  public void init(IndexingConnectorContext context)
      throws Exception {
    // ...
    context.registerTraverser(
       new TraverserConfiguration.Builder()
           .itemRetriever(this)
           .build());
    // ...
  }

  // ...
  @Override // from the ItemRetriever interface
  public void process(Item item)
     throws IOException, InterruptedException {
    // ...
  }
}

Give the connector access to the Cloud Search API

The getIndexingService()method enables the IndexingConnector to access the Cloud Search indexing API.

The IndexingConnector object typically stores the passed IndexingService object to make indexing requests to the Cloud Search API during future document traversals. All traversal strategies use this method.

Retrieve registered objects

The getTraverserConfiguration() method enables the IndexingConnector object to retrieve all previously registered TraverserConfiguration objects.

The SDK uses this method to create scheduled queue polling threads after the Connector has completed registering. Typically, the IndexingConnector object does not use this method.

Retrieve the exception handler for incremental traversals

The getIncrementalTraversalExceptionHandler() method enables the Connector object to retrieve the exception handler for incremental traversals.

The SDK uses this method to configure incremental traversals. Typically, the IndexingConnector object does not use this method.

Retrieve the exception handler for traversals

The getTraversalExceptionHandler() method enables the IndexingConnector object to retrieve the exception handler for traversals. The SDK uses this method to configure traversals. Typically, the Connector object does not use this method.

Parsing configuration file parameters

The Configuration class parses the configuration file parameters and makes the values available to the connector code.

Define connector behavior

The configuration file defines connector behavior. The file contains all the default configuration parameters that the connector administrator designates as key/value pairs. The format of the key/value pairs is:

key=value

For example, myConnector.user=username

These parameters include general parameters that all connectors require, along with any connector-specific parameters that the connector developer might define. The following example shows a snippet from a configuration file.

#
# General SDK parameters
api.sourceId=s1234567890
api.identitySourceId=0987654321lmnopq
api.serviceAccountPrivateKeyFile=./PrivateKey.json

# …

#
# Connector specific parameters
myConnector.user=username
myConnector.access=read_only
# …

To pass the configuration file to the connector during execution, use the following command line argument:

-Dconfig=MyConfigFile.properties

If this argument is missing, the SDK attempts to access a default configuration file named connector-config.properties.

Maintain configuration file key/value pairs

The Configuration class contains static methods that maintain a list of configuration file key/value pairs, each stored in a ConfigValue object. The following example shows the most commonly used methods.

public class Configuration {
  public static void initConfig(String[] args) throws IOException {}

  public static boolean isInitialized() {}

  // getters
  public static ConfigValue<Boolean> getBoolean(
      String configKey, Boolean defaultValue) {}

  public static ConfigValue<String> getString(
      String configKey, String defaultValue) {}

  public static ConfigValue<Integer> getInteger(
      String configKey, Integer defaultValue) {}

  public static <T> ConfigValue<T> getValue(
      String configKey,
      T defaultValue,
      Parser<T> parser) {}

  public static <T> ConfigValue<List<T>> getMultiValue(
      String configKey,
      List<T> defaultValues,
      Parser<T> parser) {}

  // interface used by all of the getters
  public interface Parser<T> {
    T parse(String value) throws InvalidConfigurationException;
  }
}

Parse and store configuration file parameters

The initConfig() method reads the configuration file given as an argument of the invocation command. This method parses and stores the configuration key/value pairs.

Once this method executes, the configuration values are accessible from anywhere in the connector code. The SDK calls the initConfig() method when your connector's main() method calls Application.build().

Check for an initialized configuration object

The Application object initializes the static Configuration object early during initialization so that parameter values are immediately retrievable. The isInitialized() method checks for this initialization. All methods in the SDK and in connector code should call the isInitialized() method before attempting to access any configuration parameter.

Return ConfigValueobjects and values

Each of the Configuration object getter methods returns a ConfigValue object.

There are several getter methods that retrieve specific parameter value types (bool, int, String, and so on).

The ConfigValue objects store the retrievable parameter values. The Configuration object getters create the ConfigValue objects which cannot be directly instantiated by your connector code. Use ConfigValue.get() to retrieve the actual parameter value from the ConfigValue object.

The following example shows how to fetch ConfigValue objects and their parameter values from the following defined keys:

  • "config.key1" (Boolean value)
  • "config.key2" (Integer value)
  • "required.config.key3" (String value)

The first two objects have default values that are returned if the keys are not defined in the configuration file, the third value is a required parameter (denoted by a null default value) that causes an exception if it is missing from the configuration file.

// somewhere within your connector code
assertTrue(Configuration.isInitialized());
bool myBoolParam = Configuration.getBoolean("config.key1", false).get();
int myIntParam = Configuration.getInteger("config.key2", 10).get();
String myStringParam = Configuration.getString("required.config.key3", null)
    .get();

There is also a useful getter for a comma delimited list. In the following example, the list's base type is String.

List<String> myListParam = Configuration.getMultiValue(
    "required.config.listKey4",
    null,
    Configuration.STRING_PARSER)
    .get();

Interpret data types of configuration parameter values

The Configuration class defines a general Parser interface that interprets the underlying data type of the parameter value.

The standard data types all have a predefined Parser objects (Configuration.BOOLEAN_PARSER, Configuration.INTEGER_PARSER, Configuration.STRING_PARSER, and so on). If your connector requires an additional, specialized Parser, you can write and use your own by using the Configuration.getValue() method.

The following example shows how to implement a basic URL field parser.

// custom parser for URL values
Parser<URL> urlParser = value -> {
  try {
    return new URL(value);
  } catch (MalformedURLException e) {
    // handle exception...
  }
};
URL myUrl =  Configuration.getValue(
    "required.config.url", null, urlParser)
    .get();

Communicating with the Cloud Search API

The SDK performs all communication between your connector and the Cloud Search API by using an IndexingServiceImpl instance of the IndexingService interface. The IndexingService interface defines all of the Cloud Search API request types available to your connector. The SDK and your connector use the majority of the request types to maintain document indexing within Cloud Search.

The following code snippet shows the methods of the IndexingService interface.

public interface IndexingService extends Service {
  ListenableFuture<Operation> deleteItem(
      String id, byte[] version, RequestMode requestMode) throws IOException;

  Item getItem(String id) throws IOException;

  Iterable<Item> listItem(boolean brief) throws IOException;

  ListenableFuture<Operation> indexItem(
      Item item, RequestMode requestMode) throws IOException;

  ListenableFuture<Operation> indexItemAndContent(
      Item item,
      AbstractInputStreamContent content,
      @Nullable String contentHash,
      ContentFormat contentFormat,
      RequestMode requestMode)
      throws IOException;

  List<Item> poll(PollItemsRequest pollQueueRequest)
      throws IOException;

  Iterable<Item> pollAll(PollItemsRequest pollQueueRequest
      throws IOException;

  ListenableFuture<Item> push(PushItem pushItem)
      throws IOException;

  ListenableFuture<Operation> unreserve(String queue)
      throws IOException;

  UploadItemRef startUpload() throws IOException;

  Schema getSchema() throws IOException;
}

Add documents to the Cloud Search index

The connector adds every document in your data repository to the Cloud Search index by using a small subset of the IndexingService interface methods. The SDK represents each document as an Item object which you create using an SDK IndexingItemBuilder object.

Index documents without content

To index your documents in Cloud Search without content, use indexItem(). This method indexes your previously built Item object so that it is searchable on its metadata, structured data and other attributes.

Index documents with content

To index your documents into Cloud Search with content, use indexItemAndContent(). This method indexes your Item so that it is searchable on its content, as well as on its metadata, structured data and other attributes.

Inside this method, the SDK indexes the content based on its size to maximize efficiency. The SDK indexes smaller sized content in-line with the Item, while it indexes larger content separately from the Item. This separate content index uses the startUpload() method for coordination.

Maintain the indexing queue

The Cloud Search API provides a document queue that maintains state between data repository traversals. When the connector is traversing a data repository using a list or graph strategy, it pushes your document IDs to the queue and later polls them to upload them for indexing.

The push() and poll() methods perform queue-related requests. The SDK provides the following polling methods:

  • poll(), which has a settable limit on how many Item objects are returned
  • pollAll(), which does not have a settable limit

In both polling methods, you pass in a PollItemsRequest object that defines which queue name and which Item status values are targeted.

Reserve objects

When a thread polls Item objects from the Cloud Search API queue, the thread marks them internally as reserved. This prevents multiple threads from polling and acting on the same Item object in parallel.

The polled item is not accessible to another polling request until a push or index causes it to rejoin the queue or a timeout period elapses. A timeout might occur if the connector stops running during processing. You can manually make the queued Item objects available again for polling by using the unreserve() method.

Other item methods

Other useful Item object related methods include:

  • deleteItem(), for removing an Item from Cloud Search (and also from the API queue)
  • getItem(), for retrieving a single Item
  • listItem(), for retrieving all the Item objects from the Cloud Search data source

Retrieve the data source schema

A data source might contain a defined schema that supports searches on data repository structured data. If so, the SDK retrieves the data source's schema during an IndexingApplication.start() method call during connector initialization. The SDK initializes the structured data object mappings from the schema so that data repository indexes can use document values for structured data searches.

Performing connector traversals

The SDK uses the ConnectorTraverser object to schedule and execute all of the connector traversal method calls. The following table shows traversals and the methods that call them.

Traversal Method
full data repository traversal Called by Connector.traverse()
incremental repository traversal Called by IncrementalChangeHandler.handleIncrementalChanges()
queue polling traversal Called by ItemRetriever.process()

The Application.build() method creates this object very early during connector initialization . The Application object passes your Connector instance to the ConnectorTraverser object to facilitate making traversal calls on your connector.

Although the ConnectorTraverser object is not publicly accessible, the connector configures the various traversal schedules by values contained in the configuration file.

Perform full data repository traversals

The ConnectorTraverser object schedules a single task to handle Connector.traverse() calls. This task is required for the connector to execute.

The scheduling option configuration parameters define the scheduling timeframe. There are also parameters that specify whether to perform the first traversal immediately at start-up and whether to run the traversal just once and then exit.

Perform incremental data repository traversals

The ConnectorTraverser object schedules a single task to handle IncrementalChangeHandler.handleIncrementalChanges() calls. This task is optional and is only present when your connector supports increment traversals by implementing the IncrementalChangeHandler interface.

The configuration parameters define the frequency of performing an incremental traversal. There is also a parameter that you can use to define your checkpoint ID, if your connector uses checkpoints and should not use the default value.

Perform list or graph traversals

The ConnectorTraverser object schedules potentially multiple tasks to handle ItemRetriever.process() calls. These tasks are optional and are only present when your connector supports queue polling traversals by implementing the ItemRetriever interface. Use this interface in the standard implementation of a list or graph traversal strategy.

Register a traverser

Your connector must register a traverser for each scheduled queue polling traversal. The TraverserConfiguration object stores all the settings for a single queue polling traversal. The Connector object creates one or more of these objects typically in the Connector.init() method and then passes the TraverserConfiguration object to the ConnectorContext.registerTraverser() method.

The configuration parameters define various aspects of the polling traversal including:

  • The polling frequency
  • The queue(s) to query
  • The status of the documents to query

The TraverserConfiguration.Builder() object sets the TraverserConfiguration object parameters . You can either manually create and pass the API's PollItemsRequest object to the builder, or you can simply let the builder use the configuration parameters directly.

The following code snippet shows how to set the TraverserConfiguration object parameters using the configuration parameters. If only need a single queue polling traverser is required, use the default parameters instead of myKeyName.

# My Configuration File
#
# Queue Polling Traversal parameters using key name "myKeyName"
traverser.myKeyName.pollRequest.queue=QueueName
traverser.myKeyName.pollRequest.statuses=new_item,modified
traverser.myKeyName.timeout=20
traverser.myKeyName.timeunit=SECONDS
traverser.myKeyName.hostload=10
# … {other parameters}
// My Connector Code
public class MyConnector implements IndexingConnector, ItemRetriever {
  // ...
  @Override // from the Connector interface
  public void init(IndexingConnectorContext context)
      throws Exception {
    // …
    TraverserConfiguration myTraverseConfig =
        new TraverserConfiguration.Builder("myKeyName")
            .itemRetriever(this)
            .build();

    context.registerTraverser(myTraverseConfig);
    // ...
  }

  // ...
  @Override // from the ItemRetriever interface
  public void process(Item item)
      throws IOException, InterruptedException {
    // ...
  }
}

Using traversal connector templates

The traversal connector templates work with the rest of the SDK to perform all the traversal scheduling required for your connector. By using a connector template, you can spend the majority of your development effort on the access functionality of your data repository by implementing the Repository interface.

Choose the traversal connector template that is appropriate for your traversal strategy:

The connector template you choose determines the subset of Repository interface methods that you are required to implement.

Use a template for a list or graph traversals

Use the ListingConnector object when implementing a connector using a list or graph traversal.

This traversal strategy pushes document IDs to the Cloud Search queue and retrieves them one at a time for indexing. During the index, the document content is fetched from the data repository and any children document IDs are pushed to the queue.

The following code snippet shows the methods that are implemented.

public class ListingConnector
    implements IndexingConnector, ItemRetriever, IncrementalChangeHandler {
  // From Connector
  @Override
  void init(IndexingConnectorContext context) throws Exception;

  @Override
  void traverse() throws IOException, InterruptedException;

  @Override
  void destroy();

  // From ItemRetriever
  @Override
  void process(Item item)
      throws IOException, InterruptedException;

  // From IncrementalChangeHandler (optional)
  @Override
  void handleIncrementalChanges()
      throws IOException, InterruptedException;

  // Optional
  void handleAsyncOperation(AsyncApiOperation e);
}

The following table shows which Repository interface methods are called when using this connector template.

Method Summary
Repository.init() Called once by ListingConnector.init() during initialization to enable the Repository to perform any set-up functions.
Repository.getIds() Called once for each scheduled traversal by ListingConnector.traverse() to push document IDs to the Cloud Search queue. The Repository typically retrieves all "root" document IDs from the data repository. This begins the recursive traversal of a hierarchical repository.
Repository.getDoc() Called by ListingConnector.process() during the scheduled polling of the Cloud Search queue. The Repository retrieves the document content for indexing and its children IDs to push to the queue.
Repository.getChanges() Called once for each scheduled incremental traversal by ListingConnector.handleIncrementalChanges() to push modified document IDs to the Cloud Search queue. This is optional and is only used when the data repository supports detection of modified documents.
Repository.close() Called once by ListingConnector.destroy() during connector shutdown to allow the Repository to perform any data repository cleanup functions.

Use a template for a full traversal

Use the FullTraversalConnector object when implementing a connector for a repository whose data set is small or static, or non-hierarchical repositories that have no change detection capability. Rather than using the Cloud Search queue, each traversal indexes every document in the data repository.

The connector assigns a new container to all of the indexed documents from a traversal. At the end of the traversal, the connector deletes the container used in the previous traversal from Cloud Search. This deletion propagates to all of documents left in the old container because they were not discovered during the most recent traversal of the data repository. The sole use of this algorithm is data repository deleted document detection.

The following code snippet shows the methods that are implemented.

public class FullTraversalConnector
    implements IndexingConnector, IncrementalChangeHandler {
  // From Connector
  @Override
  void init(IndexingConnectorContext context) throws Exception;

  @Override
  void traverse() throws IOException, InterruptedException;

  @Override
  void destroy();

  // From IncrementalChangeHandler (optional)
  @Override
  void handleIncrementalChanges() throws IOException, InterruptedException;

  // Optional
  void handleAsyncOperation(AsyncApiOperation e);
}

The following table shows which Repository interface methods are called when using this connector template.

Method Summary
Repository.init() Called once by FullTraversalConnector.init() during initialization to enable the Repository to perform any set-up functions.
Repository.getAllDocs() Called once for each scheduled traversal by FullTraversalConnector.traverse(). The Repository retrieves all of the data repository documents with their content for indexing.
Repository.getChanges() Called once for each scheduled incremental traversal by FullTraversalConnector.handleIncrementalChanges() to retrieve modified documents and index them to Cloud Search. Optional. Use only when the data repository supports detection of modified documents.
Repository.close() Called once by FullTraversalConnector.destroy() during connector shutdown to allow the Repository to perform any data repository cleanup functions.

Interact with the connector template asynchronously

There is also an asynchronous method available to the Repository interface for interacting with the connector template outside of the normal scheduled traversal calls. The handleAsyncOperation() method available in both connector template objects allows the Repository to make ApiOperation object calls from Repository object monitored events such as document change detection. Use this optional method only for data repositories that support monitoring of events.

To use this asynchronous call, access the RepositoryContext object passed into the Repository.init() method. Your Repository performs a call back to the connector template by passing an AsyncApiOperation object to the RepositoryContext.postAsyncOperation() method.

The following code snippet shows an example of an asynchronous delete detection operation.

public MyRepository implements Repository {
  @Override
  public void init(RepositoryContext context) {
    this.context = context;

    // other initializations...
  }

  // The following method is called from a monitoring task
  // when the data repository detects a deleted document.
  private void onDocumentRemoved(String docId) {
    AsyncApiOperation operation =
        new AsyncApiOperation(ApiOperations.deleteItem(docId));
    this.context.postAsyncOperation(operation);
  }

  // other implemented methods
}

Encapsulating all repository access functionality

If you use the CloudSearch SDK traversal templates to their full effect, the majority of your efforts are focused on accessing the specific data repository. The Repository interface encapsulates all the specific repository access functionality that is required of the connector.

The following code snippet shows the Repository interface methods.

public interface Repository {
  void init(RepositoryContext context) throws RepositoryException;

  CheckpointCloseableIterable<ApiOperation> getIds(@Nullable byte[] checkpoint)
      throws RepositoryException;

  CheckpointCloseableIterable<ApiOperation> getChanges(byte[] checkpoint)
      throws RepositoryException;

  CheckpointCloseableIterable<ApiOperation> getAllDocs(
      @Nullable byte[] checkpoint) throws RepositoryException;

  ApiOperation getDoc(Item item) throws RepositoryException;

  boolean exists(Item item) throws RepositoryException;

  void close();
}

Depending on the traversal type(s) that your connector employs, you might need to implement only a subset of these methods. See the traversal template descriptions for details on the methods required for the various traversal implementations.

CreateApiOperation types

All of the getters of the Repository interface return some type of ApiOperation object. You create most of the different operation types by using the ApiOperations factory class. The IndexingService object calls the ApiOperation.execute() method on each ApiOperation object to perform a specific Cloud Search action.

Represent an index document action

The RepositoryDoc object represents an index document action.

Use a RepositoryDoc object to index a document into Cloud Search. The Repository methods build these objects from the data repository and pass them back to the traversal connector template for document indexing. This object is most often returned from the Repository.getAllDocs(), Repository.getDoc(), and Repository.getChanges() methods calls.

Use the RepositoryDoc.setContent()method when indexing content. This is preferable to using the API's Item.setContent() method as the SDK's IndexingServiceImpl.indexItemAndContent() method takes care of determining whether in-line versus data reference content uploads should be used when creating the item's ItemContent instance. The RepositoryDoc.setContent() method requires setting the content type using a ContentFormat value (html, text, or raw). This method also allows you to set the content hash value for connector's that use a listing traversal and whose repository does not have built-in change detection.

For non-hierarchical data repositories, updating a single document for indexing may be the only function of this object. For hierarchical data repositories, in addition to updating a single parent document, the RepositoryDoc object might also add PushItem objects to enable the connector template to push the parent document's children IDs for later processing. In this case, the SDK executes multiple IndexingService object actions for a single RepositoryDoc object.

Because this operation is more complex than the others, it has its own builder. Use the RepositoryDoc.Builder object instead of ApiOperations static factory methods to create an instance of this object.

Represent document IDs to push to the queue

The PushItems object represents a list of document IDs to push to the Cloud Search queue.

Use the PushItems object to return root-level documents from a hierarchical data repository during a list traversal. This object's action does not cause any Cloud Search indexing to occur, nor does any new searchable content become available. Pushing IDs only queues documents for later polling by the ItemRetriever interface. The ListingConnector template's Repository.getIds()and Repository.getChanges() method calls usually return PushItems objects.

Use PushItems.Builder() to create an instance.

Represent a delete document action

The DeleteItem object represents a delete document action.

Use the DeleteItem object any time the Repository instance detects a document's removal from the data repository. For data repositories that can detect deleted documents, the Repository.getAllDocs(), Repository.getDoc(), and Repository.getChanges() method calls might return this object.

Use ApiOperations.deleteItem() to create an instance.

Represent an action for a set of modified documents

The CheckpointCloseableIterable<ApiOperation><ApiOperation> object represents the actions for detected modified documents.

Use the CheckpointCloseableIterable<ApiOperation> object to return the modified documents during a scheduled incremental traversal Repository.getChanges() method call. Depending on the detection capabilities of the data repository, this object might contain ApiOperation objects that correspond to added, modified, or deleted documents.

This object also holds a checkpoint that your Repository instance defines to enable your traversal code to pick up where it left off during the previous incremental traversal. Your Repository implementation defines the contents of this checkpoint.

Use CheckpointCloseableIterableImpl<ApiOperation> to create an instance.

Representing a data repository document

The Cloud Search API uses an Item object to represent a data repository document. The Item object contains the all the information related to the document within Cloud Search, including its:

  • Content
  • ACL
  • Metadata
  • Structured data
  • View URL
  • Indexing queue
  • Indexing status

The SDK provides the IndexingItemBuilder object to assist in creating the Item object before indexing it to Cloud Search.

The following code snippet shows an example of building an Item with just required attributes.

IndexingService indexingService = … // stored from Connector.init()

String documentName = … // create a unique ID for the document
Acl acl = new Acl.Builder() // general public domain ACL
    .setReaders(Collections.singletonList(
        Acl.getCustomerPrincipal()))
    .build();
String viewUrl = … // create unique view URL for the document
Byte[] version = … // create a version for the document

IndexingItemBuilder indexingItemBuilder =
    new IndexingItemBuilder(documentName)
        .setItemType(ItemType.CONTENT_ITEM)
        .setAcl(acl)
        .setUrl(IndexingItemBuilder.FieldOrValue.withValue(viewUrl))
        .setVersion(version);

Item myItem = indexingItemBuilder.build();

// document content is optional
// indexingService.indexItem(myItem, IndexItemMode.SYNCHRONOUS);
ByteArrayContent myContent = … // create searchable content

// can use RepositoryDoc template object here instead of this call
indexingService.indexItemAndContent(
    myItem,
    myContent,
    ContentFormat.TEXT,
    IndexItemMode.SYNCHRONOUS);

Index document content

You may have noticed that the document content is optional for indexing, which usually means that the document is only searchable from its metadata and structured data (if present). An example of this use case might be for fixed record based data repositories such as certain database and CSV repositories or possibly card-based repositories, such as a sales management system.

The IndexingItemBuilder object does not provide a setter method for content. Use either the IndexingService.indexItemAndContent() method or a RepositoryDoc object to add content to a document for indexing. Both of these methods handle either in-line content or, for larger content sizes, separate content uploads using IndexingService.startUpload(). The in-line and separate content upload calls correspond to the Cloud Search API update and index calls respectively.

Designate structured data object and data mapping

In addition to the Item object's required fields, the IndexingItemBuilder object also assists with setting up your document's structured data. If your data source has a schema associated with it, you can use the IndexingItemBuilder object setter methods to designate the document's structured data object and data mapping.

The structured data mapping consists of key/value pairs where the key is a defined field name within your structure data object, and the value is specific to your document. When retrieving each document from your data repository, build a key/value map that contains all of your structured data pairs.

The following table show the setter methods for specifying structured data for your document.

Method Summary
setObjectType() Selects the structured data object to use within the data source's schema. A schema might have multiple object types defined within it, so this method selects which one of those to use.
setValues() Passes the map of structured data key/value pairs to the IndexingItemBuilder object. Any key name that matches a field within the structured data object is automatically populated with the document's value from the map.

This method is also used for metadata. Just one call with a single map is used for both structured data and metadata.

Populate document metadata fields

You can also use the IndexingItemBuilder object to assist with setting up your document's metadata fields. The metadata field values are searchable along with the document's content within Cloud Search.

The following table shows setter methods for explicitly populating the document metadata fields.

Method Summary
setTitle() The document's title field
setLastModified() The document's last modified date time field
setCreationTime() The document's created date time field
setLanguage() The document's native language field

Designate an argument as a field or value

All of the metadata setter methods (and also the view URL setter method) receive a FieldOrValue object as their argument.

This object allows you to designate the argument as either a field within your value map, or as the actual value to use.

  • To specify that the argument is a field within the value map, use the FieldOrValue.withField() method.
  • To specify that the argument is the actual metadata value, use the FieldOrValue.withValue() method.

The advantage of using the withField() method, is that the FieldOrValue object automatically handles null values and date/time type (including long) value conversions.

The following code snippet shows a use case for setting up an Item object's metadata.

// Fetch field key/value from repository
String myLanguage = getLanguageValueFromRepository();
Multimap<String, Object> multiMapValues =
    ArrayListMultimap.create();
multiMapValues.put("myTitleField", getTitleFromRepository());
multiMapValues.put("myViewUrl", getViewUrlFromRepository());
multiMapValues.put("myCreationDate", getCreationAsDatetime());
multiMapValues.put("myModifiedDate", getModifiedAsLong());
//...

// create the Item for doc 1
IndexingItemBuilder doc1Builder =
    new IndexingItemBuilder("document name 1")
        .setValues(multiMapValues);
// title uses the value from the map
doc1Builder.setTitle(FieldOrValue.withField("myTitleField"));
// view URL uses the value from the map
doc1Builder.setUrl(FieldOrValue.withField("myViewUrl"));
// creation date uses the value from the map
doc1Builder.setCreationTime(
    FieldOrValue.withField("myCreationDate"));
// modified date uses the value from the map
doc1Builder.setLastModified(
    FieldOrValue.withField("myModifiedDate"));
// but language metadata uses a stand alone value
doc1Builder.setLanguage(FieldOrValue.withValue(myLanguage));

// many other setters later...
Item doc1 = doc1Builder.build();
// do something with the Item...

Building HTML formatted content for your documents

The ContentTemplate object enables you to build HTML formatted content for your document based on its underlying repository data. Set up this helper object once during initialization and then use it to format the content of each document as the connector retrieves it from the data repository. This object is useful for data repositories that are record-based and whose data is easily represented as a map of key/value pairs that equates to the document's record field values.

To initialize the ContentTemplate object, either directly classify the fields from the data repository by priority, or simply allow all the field definition to be specified in the configuration file. In either case, you must specify a title field from the data repository. For the remainder of the data repository fields, you can optionally designate a relative search priority of high, medium, or low.

Build your content template manually

Use the ContentTemplate.Builder object to build your content template manually. The following table shows all the ContentTemplate.Builder object's setter methods.

Method Summary
setTitle() Sets the HTML title which is also the highest search priority for the content. This field is required.
setHighContent() Sets all the fields passed to a high search priority.
setMediumContent() Sets all the fields passed to a medium search priority.
setLowContent() Sets all the fields passed to a low search priority.
setIncludeFieldName() Indicates whether the field name is part of the searchable content along with its value.
setUnmappedColumnMode() Indicates whether any fields not explicitly specified in the priority setters should be part of the searchable content.

The following code snippet shows an example of manually building a ContentTemplate object.

// call this once during initialization to create and save template
ContentTemplate myTemplate = new ContentTemplate.Builder()
    .setTitle("myTitleField")
    .setHighContent(Arrays.asList("highField1", "highField2")
    .setIncludeFieldName(true) // default
    .setUnmappedColumnMode(UnmappedColumnsMode.APPEND) // default
    .build();

Build your content template from configuration file parameters

To simplify coding for most use cases, you can also create the ContentTemplate object directly from configuration file parameters.

The following examples show both a section of the configuration file defining the content template and a code snippet to create the ContentTemplate object.

# Configuration File
contentTemplate.myTemplateName.title = myTitleField
contentTemplate.myTemplateName.quality.high = highField1,highField2
contentTemplate.myTemplateName.includeFieldName = true
contentTemplate.myTemplateName.unmappedColumnsMode = APPEND
# other parameters...
// Code Snippet
// call this once during initialization to create and save template
ContentTemplate myTemplate =
    new ContentTemplate.fromConfiguration("myTemplateName");

Build document content from a template

Regardless of how you create the ContentTemplate object, the connector must store it so it can be used during data repository traversals to index documents. The ContentTemplate.apply() method uses the passed document's key/value map to create the document's HTML formatted content before indexing.

The following code snippet shows how to create document content from the previously saved ContentTemplate object and the document's key/value map.

// while looping through the repository data
Map<String, Object> dataValues = getDataValueMapForThisDocument();
String htmlContent = myTemplate.apply(dataValues);
// upload the content with this item...

Building a default ACL for every document

Every document indexed by Cloud Search must have associated access permissions. The Access Control List (ACL) specifies a document's allowed and denied readers, which then determines whether any given user can see the document in their search results.​ In addition to using permissions supplied in the data repository, the connector can also supply a default ACL for documents.​

Using a default ACL is especially useful if the data repository does not contain complete ACL information. In this case, your connector can use a DefaultAcl object to set a common default ACL for every document.

The SDK defines three modes available for applying a default ACL. The mode is set during default ACL creation described in the next section. The fallback mode indicates that the default ACL is applied to the document only if the no other ACL information has been assigned to the document. The append mode indicates that the default ACL is merged or appended to any existing document ACL information. The override mode indicates that the default ACL replaces any existing document ACL information.

The connector creates the DefaultAcl object during connector initialization and then stores it for use during repository traversals. To create the DefaultAcl object, the connector either manually assigns readers, or simply allows all the ACL parameters to be specified in the configuration file.

DefaultAcl object creates a 'VIRTUAL_CONTAINER_ITEM' node with configured readers and denied readers. Each item using DefaultAcl inherits it's ACLs from 'VIRTUAL_CONTAINER_ITEM' node. By default, DefaultAcl names 'VIRTUAL_CONTAINER_ITEM' node as 'DEFAULT_ACL_VIRTUAL_CONTAINER'. You can override name for 'VIRTUAL_CONTAINER_ITEM' node either programtically or using configuration.

Build the default ACL manually

Use the DefaultAcl.Builder object to manually build your DefaultAcl object. The following table shows all the DefaultAcl.Builder object's setter methods.

Method Summary
setReaderUsers() Assigns all reader Principal objects to the ACL.
setReaderGroups() Assigns all reader Principal group objects to the ACL.
setDeniedReaderUsers() Assigns all denied reader Principal objects to the ACL.
setDeniedReaderGroups() Assigns all denied reader Principal group objects to the ACL.
setDefaultAclName() Assign name for VIRTUAL_CONTAINER_ITEM representing ACLs defined by DefaultAcl.
setMode() Determines how the default ACL is applied. Default: none
none Default ACLs are not used.
fallback Use the default ACL only if an ACL does not already exist for the document.
append Add or append the default ACL information to the existing document's ACL.
override Replace the existing document's ACL with the default ACL.
setIsPublic() Determines whether a generic public ACL should be used instead of an ACL specified explicitly. Default: true.

The following code snippet shows an example of manually building a DefaultAcl object.

// call this during initialization to create and save a default ACL
List<Principal> readerUsers = // getTheListOfReaderUsers();
    Collections.emptyList();
readerUsers.add(Acl.getUserPrincipal("user1@acme.com"));
readerUsers.add(Acl.getUserPrincipal("user2@acme.com"));
readerUsers.add(Acl.getUserPrincipal("google:user3@acme.com"));

List<Principal> readerGroups = // getTheListOfReaderGroups();
     Collections.emptyList();
readerGroups.add(Acl.getGroupPrincipal("group1@acme.com"));
readerGroups.add(Acl.getGroupPrincipal("google:group2@acme.com"));

List<Principal> deniedUsers = getTheListOfDeniedUsers();
List<Principal> deniedGroups = getTheListOfDeniedGroups();
DefaultAcl defaultAcl = new DefaultAcl.Builder()
    .setMode(DefaultAcl.DefaultAclMode.FALLBACK)
    .setIsPublic(false)
    .setReaderUsers(readerUsers)
    .setReaderGroups(readerGroups)
    .setDeniedReaderUsers(deniedUsers)
    .setDeniedReaderGroups(deniedGroups)
    .setIndexingService(indexingService)
    .build();

When creating a user or group Principal, the default is an external ID. Prefix the ID with "google:" when using a Google user or group ID.

Supply the default ACL in the configuration file

To simplify coding for most use cases, you can have the SDK automatically create the DefaultAcl object directly from configuration file parameters.

The following examples show both a section of the configuration file defining the default ACL parameters and a code snippet to create the DefaultAcl object.

# Configuration File
repository.defaultAcl.mode = fallback
repository.defaultAcl.public = false
repository.defaultAcl.readers.users = \
    user1@acme.com, user2@acme.com, google:user3@acme.com
repository.defaultAcl.readers.groups = \
    group1@acme.com, google:group2@acme.com
repository.defaultAcl.denied.users = ...
repository.defaultAcl.denied.groups = ...
repository.defaultAcl.name=DEFAULT_ACL_CONNECTOR_1
# other parameters...
// Code Snippet
// call this once during initialization to create and save template
DefaultAcl myDefaultAcl = DefaultAcl.fromConfiguration(indexingService);

Use the default ACL

Regardless of how the DefaultAcl object was created, the connector must store it for use during data repository traversals when indexing documents. When the DefaultAcl object is enabled, the applyToIfEnabled() method sets the document's ACL to the previously created default value, depending on the configured mode.

The following code snippet shows how to set a document's ACL from the previously saved DefaultAcl object.

// while looping through the repository data
Item item = ... // create the document
myDefaultAcl.applyToIfEnabled(item);
// index the item...

Send feedback about...

Cloud Search
Cloud Search