Content Connector SDK Overview

This guide is intended for Google Cloud Search content connector developers. It provides an introduction to the Cloud Search SDK, including general concepts and specific programming objects and how they interact with each other. The guide assumes that you are familiar with software development concepts and Java programming.

For information about the tasks that the G Suite administrator must perform to set up the Cloud Search data source and account creation, see Manage third-party data sources. For information about specific steps to develop a connector using the SDK, see the Content Connector SDK Guide.

Introduction

The Cloud Search API provides the connector developer a RESTful web interface for indexing data into Cloud Search. The API objects and methods provide all the capabilities needed manage data for Cloud Search. These capabilities include Indexing API requests for indexing, retrieving, and deleting documents to and from the Cloud Search index.

Although It is possible to write a connector application using just the API, this can be time consuming and cumbersome. To enable you to accelerate connector development and make it easier to build robust, high-performing connectors, Google provides the Cloud Search connector SDK, which implements common connector tasks along the following areas:

  • Google Cloud Search service communication
  • Multi-threading indexing API operations
  • Traversal strategies for the data repository
  • Traversal scheduling
  • Connector configuration
  • Content formatting for indexing
  • Error handling
  • Utility helper functions

The SDK acts as a software layer between the connector developer and the API. Because the SDK provides many of the capabilities needed by all connectors, connector developers can focus on just the part of the connector dedicated to accessing the data source, without having to write any redundant code.

A typical process for developing a connector involves using three major SDK classes of objects:

This document gives you an overview of each class of object. For additional information, including the syntax specific to each object, see the Cloud Search Connector SDK Reference and the Javadocs for the entire Cloud Search SDK.

Primary objects

You can understand the main SDK framework by first studying a few important classes that are grouped together as primary objects. These objects interact with each other to execute your connector code. Everything else in the SDK is built on top of these objects. By understanding their function and relationship to each other, you will be able to begin basic connector development.

The following diagram presents the core objects that you should understand before working with the SDK.

drawing

  1. IndexingConnector.main() calls IndexingApplication.build() passing the command line "args" which typically contains the configuration file path.
  2. IndexingApplication.build() calls Configuration.initConfig() to read the configuration file, creates an IndexingService instance and creates a IndexingConnectorContext instance infused with the IndexingService instance.
  3. IndexingConnector instance calls IndexingApplication.start() which then initiates the IndexingService instance, calls IndexingConnector.init() passing the IndexingConnectorContext instance and starts the ConnectorTraverser instance passing the Connector instance.
  4. The ConnectorTraverser instance makes Connector instance traversal calls based on schedule parameters from the configuration file. The IndexingConnector instance has access to the IndexingService instance via its IndexingConnectorContext instance to make API requests.

The following table summarizes the primary objects.

Class/Interface Summary
IndexingConnector interface Performs all data repository access and controls how documents are indexed by Cloud Search.
IndexingApplication class Builds and manages internal instances, as shown in the diagram. Configuration class
IndexingService interface Manages the indexing of repository documents.IndexingConnectorContext interface
ConnectorTraverser class Manages document retrievals by scheduling traversals of the data repository.

IndexingConnector interface

The interface that you, as the developer, should first focus on is the IndexingConnector. This is the foundation of the connector execution and primary access point of the SDK.

The program main() begins connector execution by accessing the SDK with the IndexingApplication object. During initialization, the SDK calls the connector's init() method to allow the connector to perform any connector specific pre-traversal initializations.

After initialization, the connector receives scheduled Connector.traverse() calls from the SDK where the connector is required to perform data repository access and retrieval.

In addition to performing entire repository traversals, the connector can optionally implement one or both of the following interfaces:

ItemRetriever interface

Depending on your connector's traversal strategy, you might need to implement the ItemRetriever interface. Use this interface when the connector is performing list or graph traversals, where you push document ids and have the SDK poll the pushed ids one at a time for processing.

IncrementalChangeHandler interface

The IncrementalChangeHandler interface is useful if your data repository supports change detection. This interface enables the SDK to schedule partial repository traversals on a configurable schedule, similar to configuring entire repository traversals.

IndexingApplication class

The SDK's point of entry for the connector developer is the IndexingApplication object. Your connector program's main() typically builds an instance of this object by passing in a IndexingConnector instance as an argument.

The execution begins when your connector's main() calls the IndexingApplication.start() method. At this point, your connector gives execution control over to the SDK, which begins initialization by:

  1. Reading the configuration file.
  2. Creating an IndexingService object.
  3. Scheduling traversals.
  4. Calling the Connector.init() method.

Configuration class

The administrator for your connector can define your connector's behavior by creating a configuration file. The configuration file consists of parameters defined by key/value pairs. The SDK reads and stores these parameters using a Configuration object. After initialization, your connector code can access these parameters by using getters from this static object.

There are standardized parameters that the SDK uses for all connectors including parameters for identifying the data source and scheduling traversals. You can also define connector-specific parameters that only your connector uses. For example, a database connector would have parameters that define the database URL and parameters for a valid user and password needed to access the repository.

IndexingService interface

The SDK creates an IndexingService instance during initialization that the connector uses to perform the actual API calls to interact with Cloud Search. All the methods for document pushing, polling, indexing, retrieving, deleting, and so on are contained within this object.

The IndexingService object is passed to the connector during the Connector.init() method call. Your Connector should save this instance for use later when your connector's traversal methods are called by the SDK.

IndexingConnectorContext interface

A IndexingConnectorContext object is passed to the IndexingConnector from the IndexingApplication in the IndexingConnector.init() method. The Connector uses this instance to:

  • Retrieve the SDK-created IndexingService object for use during document traversals to interact with Cloud Search.
  • Define how to poll document ids from the Cloud Search queue when your connector uses the ItemRetriever interface.

ConnectorTraverser class

The ConnectorTraverser object performs connector traversal scheduling and execution. This SDK creates this object during initialization using the connector scheduling parameters defined in the configuration file.

The scheduled traversals run in parallel in separate threads, including:

  • IndexingConnector.traverse() — The main connector traversal
  • IncrementalChangeHandler.handleIncrementalChanges() — The optional incremental traversal
  • ItemRetriever.process() — The optional polling of pushed document ids used by a list traversal

Template objects

The next set of SDK classes are grouped together as template objects. These objects are intended to reduce your connector code development to just accessing your data repository (be it a database, CRM system, or any other type of container of your data). The template objects interact closely with the primary objects, therefore a basic understanding of the primary objects is necessary. Though optional, the template objects are recommended to be your default starting point for writing connector code.

The following diagram presents both the template objects and the primary SDK objects. As you can see, by using one of the existing SDK connector templates, your implementation of just the Repository class contains the majority of the code that you are required to write.

drawing

  1. Your connector's main() method selects appropriate connector template, infuses its repository, calls IndexingApplication.start().
  2. The IndexingApplication instance reads configuration file, schedules traversers, calls repository initialization methods.
  3. The ConnectorTraverser calls the Connector template traversals on a configured schedule.
  4. The Connector template calls Repository methods to fetch data.
  5. The Connector template sends API operations to indexing service which converts to API calls.

Though optional, Google recommends most connector developers use these SDK extension template objects where applicable to speed development .

The following table summarizes the template objects.

Class/Interface Summary
ListingConnector class Performs a list or graph traversal of your repository.
FullTraversalConnector class Performs a full traversal of your repository. Repository interface
PushItems class Represents a push API request for documents.
DeleteItem class Represents a delete API request for a documents.
CheckpointCloseableIterable<ApiOperation> class Represents an incremental update API request for documents.

Traversal connector template classes

Traversing data repositories is the primary function of any connector. Depending on the type of repository, a specific traversal strategy is employed.

As the connector developer, your first decision is to select the SDK traversal connector template that is most appropriate for your repository:

ListingConnector class

The ListingConnector object implements the methods of the IndexingConnector interface, but also implements the interface methods of both the ItemRetriever and IncrementalChangeHandler.

The following table provides an overview of these implementations.

Interface Method Calls
IndexingConnector IndexingConnector.init() Repository.init() method for initialization.
IndexingConnector.traverse() Repository.getIds() method for pushing document ids that are polled later for updating.
ItemRetriever ItemRetriever.process() Repository.getDoc() method to update the documents that were pushed from the previous Connector.traverse() calls.
IncrementalChange Handler IncrementalChangeHandler.handleIncrementalChanges() Repository.getChanges() method to traverse only modified documents from the data repository for indexing.

Note: This is an optional feature of this template connector that is only implemented in the Repository when the repository supports modified document detection.

FullTraversalConnector class

The FullTraversalConnector object implements the IndexingConnector and IncrementalChangeHandler interfaces.

The following table provides an overview of these implementations.

Interface Method Calls
IndexingConnector IndexingConnector.init() Repository.init() method for initialization.
IndexingConnector.traverse() Repository.getAllDocs() method for indexing all the data repository's documents.
IncrementalChangeHandler IncrementalChangeHandler.handleIncrementalChanges() Repository.getChanges() method to traverse only modified documents from the data repository for indexing.

Note: This is an optional feature of this template connector that is only implemented in the Repository when the repository supports modified document detection.

Repository interface

When using the full range of template objects describe in this guide, the only significant coding required of the connector developer is in the implementation of the Repository class. Only a subset of the Repository interface methods are required depending on the traversal connector template (ListingConnector or FullTraversalConnector) chosen for your repository.

The sole purpose of the Repository object is to perform the access and retrieval of the specific data repository's documents. Each of the methods of this object returns some type of ApiOperation object(s). An ApiOperation object performs an action in the form of a single or perhaps multiple IndexingService calls.

ApiOperation interface

The ApiOperation interface encapsulates a specific Cloud Search service request. Use this interface within the Repository object methods for return values back to the Connector object to perform IndexingService calls to the API using the ApiOperation.execute() method.

Your Repository methods use the static helper class ApiOperations to create most of the individual ApiOperation objects as described in the following table.

Class Summary
RepositoryDoc class Use the RepositoryDoc object to index a document.

This object contains all the document properties including content, ACLs, metadata, structured data, and so on. It may also contain the ids of its children if the data repository is hierarchical.

PushItems class Use the PushItems object to push multiple document ids to the Cloud Search queue for future processing.

This object contains only document id values (no content or metadata) and is only used by the ListingConnector template.

DeleteItem class Use the DeleteItem object to remove a document from Cloud Search.

This object is used by data repositories that support some form of document delete detection.

Helper objects

The final set of SDK classes described in this overview are grouped together as helper objects. These objects assist with connector development by performing specific common tasks. Google recommends using these classes wherever appropriate to save development time, even if you do not use the template objects.

The following table summarizes the helper objects.

Class/Interface Summary
IndexingItemBuilder class Helps create fully formed documents ready for indexing.
ContentTemplate class Helps in formatting searchable document content.
DefaultAcl class Helps in defining a default Access Control List (ACL) that is used during document indexing when no specific ACL is assigned to the document.

IndexingItemBuilder class

TheIndexingItemBuilder class allows you to create a completely formed Item object by using its setter methods. Each data repository document is represented internally as an API Item object. An Item contains all the information that Cloud Search needs to index a document, including its:

  • Document id
  • Metadata
  • Content
  • Structured data values
  • ACL information

Along with the fields already mentioned, there are many other fields contained within an Item object that you should understand when writing connector code. Consult the SDK reference and links within this guide for more detailed information.

ContentTemplate class

The ContentTemplate object enables you to specify a data format that can be applied to your document's data field values each time a document is retrieved from the data repository. Each document indexed within Cloud Search is searchable based on its structured data and content.

During traversals, the Repository object creates each document's content from the repository data. The format of the data is not restricted by the SDK, but if the data repository is creating content based on a map of data field name and values (for example: database, CVS, CRM, or similar), you may find the SDK ContentTemplate helper class useful.

Your Repository.init() method typically creates the ContentTemplate instance by specifying the names and priority weights of each of your data's fields. During data repository traversals, the ContentTemplate object is applied to the data fields, creating a fully formatted string that can be used as content when creating a RespositoryDoc object.

DefaultAcl class

The DefaultAcl object allows you to specify a common Access Control List (ACL) for Cloud Search to use for you entire repository. Every document has an ACL to specify its accessibility within Cloud Search. Each document typically has an ACL object containing a set of allowed and denied readers and/or groups associated with it. The ACL object determines document accessibility at search time. If the repository does not support accessibility rules for some or all of its documents, then the SDK provides a helper class to specify defaults.

To use a DefaultAcl object, your Repository.init() method typically creates the instance at startup. During traversals, the Repository can then call the DefaultAcl.applyToIfEnabled() method to set the document's ACL to a default depending on the configured mode.

Send feedback about...

Cloud Search
Cloud Search