This page of the Cloud Search tutorial shows how to set up a data source and content connector for indexing data. To start from the beginning of this tutorial, refer to Cloud Search getting started tutorial
Build the connector
Change your working directory to the cloud-search-samples/end-to-end/connector
directory and run this command:
mvn package -DskipTests
The command downloads the required dependencies needed for building the content connector and compiles the code.
Create service account credentials
The connector requires service account credentials to call the Cloud Search APIs. To create the credentials:
- Return to the Google Cloud console.
- In the left navigation, click Credentials. The "Credential" page appears.
- Click the + CREATE CREDENTIALS drop-down list and select Service account. The "Create service account" page appears.
- In the Service account name field, enter "tutorial".
- Note the Service account ID value (right after the Service account name). This value is used later.
- Click CREATE. The "Service account permissions (optional)" dialog appears.
- Click CONTINUE. The "Grant users access to this service account (optional)" dialog appears.
- Click DONE. The "Credentials" screen appears.
- Under Service Accounts, click on the service account email. The "service account details" page appeaers.
- Under Keys, click the ADD KEY drop-down list and select Create new key. The "Create private key" dialog appears.
- Click CREATE.
- (optional) If the "Do you want to allow downloads on console.cloud.google.com?” dialog appears, click Allow.
- A private key file is saved to your computer. Note the location of the downloaded file. This file is used to configure the content connector so it can authenticate itself when calling the Google Cloud Search APIs.
Initialize third-party support
Initialize third-party support for Google Cloud Search before you call any other Cloud Search APIs.
To initialize third-party support:
- Create web application credentials in your Cloud Search platform project. See Create credentials. You need the client ID and client secret.
- Obtain an access token using the
OAuth 2.0 Playground:
- Click OAuth 2.0 Configuration (settings icon) and check Use your own OAuth credentials.
- Enter your client ID and client secret.
- In the scopes field, enter
https://www.googleapis.com/auth/cloud_search.settingsand click Authorize APIs. - Click Exchange authorization code for tokens.
Run this curl command, replacing
[YOUR_ACCESS_TOKEN]with your token:curl --request POST \ 'https://cloudsearch.googleapis.com/v1:initializeCustomer' \ --header 'Authorization: Bearer [YOUR_ACCESS_TOKEN]' \ --header 'Accept: application/json' \ --header 'Content-Type: application/json' \ --data '{}' \ --compressedIf successful, the response body includes an
operation. If it fails, contact Cloud Search support.Use
operations.getto verify initialization:curl 'https://cloudsearch.googleapis.com/v1/operations/<var>operation_name</var>?key=[YOUR_API_KEY]' \ --header 'Authorization: Bearer [YOUR_ACCESS_TOKEN]' \ --header 'Accept: application/json' \ --compressedInitialization is complete when
doneistrue.
Create the data source
Next, create a data source in the admin console. The data source provides a namespace for indexing content using the connector.
- Open the Google Admin console.
- Click the Apps icon. The "Apps administration" page appears.
- Click Google Workspace. The "Apps Google Workspace administration" page appears.
- Scroll down and Click Cloud Search. The "Settings for Google Workspace" page appears.
- Click Third-party data sources. The "Data Sources" page appears.
- Click the round yellow +. The "Add new data source" dialog appears.
- In the Display name field, type "tutorial".
- In the Service account email addresses field, enter the email address of the service account you created in the previous section. If you do not know the email address of the service account, look up the value in the service accounts page.
- Click ADD. The "Successfully created data source" dialog appears.
- Click *OK. Note the Source ID for the newly created data source. The Source ID is used to configure the content connector.
Generate a personal access token for the GitHub API
The connector requires authenticated access to the GitHub API in order to have sufficient quota. For simplicity, the connector leverages personal access tokens instead of OAuth. Personal tokens allow authenticating as a user with a limited set of permissions similar to OAuth.
- Log in to GitHub.
- In the upper-right corner, click on your profile picture. A drop-down menu appears.
- Click Settings.
- Click Developer settings.
- Click Personal access tokens.
- Click Generate personal access token.
- In the Note field, enter "Cloud Search tutorial".
- Check the public_repo scope.
- Click Generate token.
- Note the generated token. It is used by the connector to call the GitHub APIs and provides API quota to perform the indexing.
Configure the connector
After creating the credentials and data source, update the connector configuration to include these values:
- From the command line, change directory to
cloud-search-samples/end-to-end/connector/. - Open the
sample-config.propertiesfile with a text editor. - Set the
api.serviceAccountPrivateKeyFileparameter to the file path of the service credentials you previously downloaded. - Set the
api.sourceIdparameter to the ID of the data source you previously created. - Set the
github.userparameter to your GitHub username. - Set the
github.tokenparameter to the access token you previously created. - Save the file.
Update the schema
The connector indexes both structured and unstructured content. Before indexing data, you must update the schema for the data source. Run the following command to update the schema:
mvn exec:java -Dexec.mainClass=com.google.cloudsearch.tutorial.SchemaTool \
-Dexec.args="-Dconfig=sample-config.properties"
Run the connector
To run the connector and begin indexing, run the command:
mvn exec:java -Dexec.mainClass=com.google.cloudsearch.tutorial.GithubConnector \
-Dexec.args="-Dconfig=sample-config.properties"
The default configuration for the connector is to index a single repository
in the googleworkspace organization. Indexing the repository takes about 1 minute.
After initial indexing, the connector continues to poll for changes to the
repository that need to be reflected in the Cloud Search index.
Reviewing the code
The remaining sections examine how the connector is built.
Starting the application
The entry point to the connector is the GithubConnector class. The
main method instantiates the SDK's IndexingApplication
and starts it.
The ListingConnector
provided by the SDK implements a traversal strategy
that leverages Cloud Search queues
for tracking the state of items in the index. It delegates to GithubRepository,
implemented by the sample connector, for accessing content from GitHub.
Traversing the GitHub repositories
During full traversals, the getIds()
method is called to push items that may need to be index into the queue.
The connector can index multiple repositories or organizations. To miminize the
impact of a failure, one GitHub repository is traversed at a time. A checkpoint
is returned with the results of the traversal containing the list of
repositories to be index in subsequent calls to getIds(). If an error
occurs, indexing is resumed at the current repository instead of starting
from the beginning.
The method collectRepositoryItems() handles the traversal of a single
GitHub repo. This method returns a collection of ApiOperations
representing the items to be pushed into the queue. Items are pushed as a
resource name and a hash value representing the current state of the item.
The hash value is used in subsequent traversals of the GitHub repositories. This value provides a lightweight check to determine if the content has changed without having to upload additional content. The connector blindly queues all items. If the item is new or the hash value has changed, it is made available for polling in the queue. Otherwise the item is considered unmodified.
Processing the queue
After the full traversal completes, the connector begins polling the
queue for items that need to be indexed. The getDoc()
method is called for each item pulled from the queue. The method reads
the item from GitHub and converts it into the proper representation
for indexing.
As the connector is running against live data that may be changed at any
time, getDoc() also verifies that the item in the queue is still valid
and deletes any items from the index that no longer exist.
For each of the GitHub objects the connector indexes, the corresponding
indexItem() method handles building the item representation for
Cloud Search. For example, to build the representation for content items:
Next, deploy the search interface.