Deploy a CSV Connector

This guide is intended for Google Cloud Search CSV (comma-separated values) connector administrators, that is, anyone who is responsible for downloading, configuring, running, and monitoring the connector.

This guide includes instructions for performing key tasks related to CSV connector deployment:

  • Download the Google Cloud Search CSV connector software
  • Configure the connector for use with a specific CSV data source
  • Deploy and run the connector

To understand the concepts in this document, you should be familiar with the fundamentals of G Suite, CSV files, and Access Control Lists (ACLs).

Overview of the Google Cloud Search CSV connector

The Cloud Search CSV connector works with any comma-separated values (CSV) file, that is a delimited text file that uses a comma to separate values. A CSV file stores tabular data and each line of the file is a data record.

Google Cloud Search's CSV Connector extracts individual rows from a CSV file and indexes them into Cloud Search via Cloud Search's Indexing API. Once successfully indexed, individual rows from CSV files are searchable through Cloud Search's clients or Cloud Search's Query API. The CSV connector also supports controlling users' access to content in the search results, by using ACLs.

Google Cloud Search CSV connector can be installed on Linux or Windows. Before you deploy the Google Cloud Search CSV connector, ensure that you have the following required components:

  • Java JRE 1.8 installed on a computer that runs the Google Cloud Search CSV connector
  • G Suite information required to establish relationships between Google Cloud Search and the data source:

    Typically, the G Suite administrator for the domain can supply these credentials for you.

Deployment steps

To deploy the Google Cloud Search CSV connector follow these steps:

  1. Install the Google Cloud Search CSV connector software
  2. Specify the CSV connector configuration
  3. Configure access to the Google Cloud Search data source
  4. Configure CSV file access
  5. Specify columns names to index, unique key columns, and datetime columns
  6. Specify columns to use in clickable search result URLs
  7. Specify metadata information, column formats
  8. Schedule data traversal
  9. Specify Access Control List (ACL) options

1. Install the Google Cloud Search CSV connector

Google provides the installation software for the connector in the following file:

google-cloudsearch-csv-connector-v1-0.0.2.zip

Download and extract the CSV connector and save it to a local working directory where the connector runs. This directory can also contain all the relevant files required for execution, including the configuration file, service account key file.

2. Specify the CSV connector configuration

As the connector administrator, you control the CSV connector's behavior and attributes defining parameters in the connector's configuration file. Configurable parameters include:

  • Access to a data source
  • Location of the CSV file
  • CSV column definitions
  • Column(s) that define a unique id
  • Traversal options
  • ACL options to restrict data access

For the connector to properly access a CSV file and index the relevant content, you must first create its configuration file.

To create a configuration file:

  1. Open a text editor of your choice and name the configuration file.
    Add key=value pairs to the file contents as described in the following sections.
  2. Save and name the configuration file.
    Google recommends that you name the configuration file connector-config.properties so no additional command line parameters are required to run connector.

Because you can specify the configuration file path on the command line, a standard file location is not necessary. However, keep the configuration file in the same directory as the connector to simplify tracking and running the connector.

To ensure the connector recognizes your configuration file, specify its path on the command line. Otherwise, the connector uses connector-config.properties in your local directory as the default file name. For information about specifying the configuration path on the command-line, see Run the CSV connector.

3. Configure access to the Google Cloud Search data source

The first parameters every configuration file must specify are the ones necessary to access the Cloud Search data source, as shown in the following table. Typically, you will need the Data source ID, service account ID, and the path to the service account's private key file in order to configure the connector's access to Cloud Search. The steps required to set up a data source are described in Manage third-party data sources

Setting Parameter
Data source ID api.sourceId=1234567890abcdef

Required. The Google Cloud Search source ID set up by the G Suite administrator, as described in Manage third-party data sources.

Path to the service account private key file api.serviceAccountPrivateKeyFile=./PrivateKey.json

Required. The Google Cloud Search service account key file for Google Cloud Search CSV connector accessibility.

Identity source ID api.identitySourceId=x0987654321

Required if using external users and groups. The Cloud Search identity source ID set up by the G Suite administrator.

4. Configure CSV file access

Before the connector can traverse a CSV file and extract data from it for indexing, you must identify the path to the file. Use the following parameter to add access information to the configuration file.

Setting Parameter
Path to the CSV file csv.filePath=./movie_content.csv

Required. The path to the CSV file to be accessed and extract content for indexing.

5. Specify column names to index, unique key columns, and datetime columns

The benefit of indexing CSV files into Cloud Search is that they can be made searchable. For the connector to access and index CSV files, you must provide information about column definitions in the configuration file. If the configuration file does not contain the parameters that specify the column names to index, unique key columns, and datetime columns, default values are used. The following table shows these parameters.

Setting Parameter
Columns to index csv.csvColumns=movieId,movieTitle,description,actors,releaseDate,year,userratings...

The column names to be indexed from the CSV file. If csv.csvColumns is not set, then the first row of the CSV file is used as the header. If csv.csvColumns is set, then it takes precedence over the first row of the CSV. If you have set csv.csvColumns and the first row of the CSV file is a list of column names, then you need to set csv.skipHeaderRecord=true to avoid trying to index the first row as data. Default values are the columns in the header row in the file.

Unique key columns csv.uniqueKeyColumns=movieId

The CSV column(s) whose values will be used to generate each record's unique ID. If not specified, the hash of the CSV record should be used as its unique key. Default value is the record's hashcode.

Datetime columns csv.dateTimeColumns=releaseDate

The column names in the CSV file that have a datetime value. Default value is an empty list.

6. Specify columns to use in clickable search result URLs

When a user searches using Google Cloud Search, it responds by showing a results page that includes clickable URLs for each result. To enable this feature, you must add the parameter shown in the following table to the configuration file.

Setting Parameter
Search result URL format url.format=https://mymoviesite.com/movies/{0}

Required. The format to construct view URL for CSV content.

Search results URL parameters. url.columns=movieId

Required. The CSV column names whose values will be used to generate the record's view url.

Search results URL parameters to escape url.columnsToEscape=movieId

Optional. The CSV column names whose values will be URL escaped to generate valid view url.

7. Specify metadata information, column formats, search quality

You can add parameters to the configuration file that specify:

Metadata Configuration Parameters

Metadata Configuration Parameters describes the CSV columns used for populating item metadata. If the configuration file does not contain these parameters, default values are used. The following table shows these parameters.

Setting Parameter
Title itemMetadata.title.field=movieTitle
itemMetadata.title.defaultValue=Gone with the Wind
The metadata attribute that contains the value corresponding to the document title. The default value is an empty string.
Created timestamp itemMetadata.createTime.field=releaseDate
itemMetadata.createTime.defaultValue=1940-01-17
The metadata attribute that contains the value for the document creation timestamp.
Last modified time itemMetadata.updatetime.field=releaseDate
itemMetadata.updatetime.defaultValue=1940-01-17
The metadata attribute that contains the value for the last modification timestamp for the document.
Document language itemMetadata.contentLanguage.field=languageCode
itemMetadata.contentLanguage.defaultValue=en-US
The content language for documents being indexed.
Schema object type itemMetadata.objectType=movie
The object type used by the connector, as defined in the create and register a schema. The connector won't index any structured data if this property is not specified.

Note: This configuration property points to a value rather than a metadata attribute. The .field and .defaultValue sufffixes are not supported.

Datetime formats

Datetime formats specify the formats expected in metadata attributes. If the configuration file does not contain this parameter, default values are used. The following table shows this parameter.

Setting Parameter
Additional datetime formats structuredData.dateTimePatterns=MM/dd/uuuu HH:mm:ssXXX
A semicolon-separated list of additional java.time.format.DateTimeFormatter patterns. The patterns are used when parsing string values for any date or date-time fields in the metadata or schema. The default value is an empty list, but RFC 3339 and RFC 1123 formats are always supported.

Column formats

Column formats specify information about the column(s) that should be a part of the searchable content. If the configuration file does not contain these parameters, default values are used. The following table shows these parameters.

Setting Parameter
Skip header csv.skipHeaderRecord=true

Boolean. Ignore the header record (first line) in the CSV file. If you have set csv.csvColumns and the CSV file has a header row, then you must set skipHeaderRecord=true. This prevents indexing the first row in the file as data. If the CSV file does not have a header row, set skipHeaderRecord=false. The default value is false.

Multi-value columns csv.multiValueColumns=genre,actors

The column names in the CSV file that have multiple values. The default value is an empty string.

Delimiter for multi-value columns csv.multiValue.genre=;

The delimiter for the multi-value columns. The default delimiter is a comma.

Search quality

The Cloud Search CSV connector allows automatic HTML formatting for data fields. Your connector defines the data fields at the beginning of connector execution, and then uses a content template to format each data record before uploading it to Cloud Search.

The content template defines the importance of each field value for searching. The title field is required and is defined as the highest priority. You can designate search quality importance levels for all the other content fields: high, medium or low. Any content field not defined in a specific category defaults to low priority. The following table shows these parameters.

Setting Parameter
Content title contentTemplate.csv.title=movieTitle

The content title is the highest search quality field.

High search quality for content fields contentTemplate.csv.quality.high=actors

Content fields given a high search quality value. The default is an empty string.

Low search quality for content fields contentTemplate.csv.quality.low=

Content fields given a low search quality value. The default is an empty string.

Medium search quality for content fields contentTemplate.csv.quality.medium=description

Content fields given a medium search quality value. The default is an empty string.

Unspecified content fields contentTemplate.csv.unmappedColumnsMode=IGNORE

How the connector handles unspecified content fields. Valid values are:

  • APPEND—append unspecified content fields to the template
  • IGNORE—ignore unspecified content fields

    The default value is APPEND.

8. Schedule data traversal

Traversal is the connector's process for discovering content from the data source, in this case, a CSV file. As the CSV connector runs, it will traverse the rows of a CSV file, and index each row to Cloud Search via the Indexing API.

Full traversal indexes all columns in the file. Incremental traversal only indexes columns that are added or modified since the previous traversal. The CSV connector only performs full traversals. It does not perform incremental traversals.

The scheduling parameters determine how often the connector waits between traversals. If the configuration file does not contain scheduling parameters, default values are used. The following table shows these parameters.

Setting Parameter
Full traversal after an interval schedule.traversalIntervalSecs=7200

The connector performs a full traversal after a specified interval. Specify the interval between traversals in seconds. The default value is 86400 (number of seconds in one day).

Full traversal at connector startup schedule.performTraversalOnStart=false

The connector performs a full traversal at connector startup, rather than waiting for the first interval to expire. The default value is true.

9. Specify Access Control List (ACL) options

Google Cloud Search CSV connector supports permissions through ACLs to control access to the content of the CSV file in search results. There are multiple ACL options available to allow you to protect user access to indexed records.

If your repository has individual ACL information associated with each document, upload all ACL information to control document access within Cloud Search. If your repository provides partial or no ACL information, you can supply default ACL information in the following parameters, which the SDK provides to the connector.

The connector relies on default ACLs being enabled in the configuration file. To enable default ACLs, set defaultAcl.mode to any mode other than none and configure it with defaultAcl.*

Setting Parameter
ACL mode defaultAcl.mode=fallback

Required. CSV connector rely on Default ACL functionality. Connector supports only fallback mode.

Default ACL Name defaultAcl.name=VIRTUAL_CONTAINER_FOR_CONNECTOR_1

Optional. Allows to override virtual container name used by connector to setup default ACLs. Default value is "DEFAULT_ACL_VIRTUAL_CONTAINER". You may want to override this value if multiple connectors are indexing content in same datasource.

Default public ACL defaultAcl.public=true

The default ACL used for the entire repository is set to public domain access. The default value is false.

Common ACL group readers defaultAcl.readers.groups=google:group1, group2
Common ACL readers defaultAcl.readers.users=user1, user2, google:user3
Common ACL denied group readers defaultAcl.denied.groups=group3
Common Acl denied readers defaultAcl.denied.users=user4, user5
Entire domain access To specify that every indexed record be publicly accessible by every user in the domain, set both of the following options with values:
  • defaultAcl.mode=fallback
  • defaultAcl.public=true
Common defined ACL To specify one ACL for each record of the data repository, set all of the following parameter values:
  • defaultAcl.mode=fallback
  • defaultAcl.public=false
  • defaultAcl.readers.groups=google:group1, group2
  • defaultAcl.readers.users=user1, user2, google:user3
  • defaultAcl.denied.groups=group3
  • defaultAcl.denied.users=user4, user5

    Every specified user and group is assumed to be a local domain-defined user/group unless prefixed with "google:" (literal constant).

    The default user or group is an empty string. Supply user and group options only if defaultAcl.public is set to false. To list multiple groups and users, use comma-delimited list.

    If defaultAcl.mode is set to none, records are unsearchable without defined individual ACLs.

Schema Definition

Cloud Search allows indexing and serving of structured and unstructured content. In order to support structured data queries on your data, you need to setup Schema for your datasource.

Once defined, CSV Connector can refer defined schema to build indexing requests. To provide an illustrative example, let's consider a CSV file containing information about Movies.

Let's assume, input CSV file has following content.

  1. movieId
  2. movieTitle
  3. description
  4. year
  5. releaseDate
  6. actors (multiple values separated by comma (,))
  7. genre (multiple values)
  8. ratings

Based on above structure of data, you can define schema for a datasource under which you want to index data from CSV file.

{
  "objectDefinitions": [
    {
      "name": "movie",
      "propertyDefinitions": [
        {
          "name": "actors",
          "isReturnable": true,
          "isRepeatable": true,
          "isFacetable": true,
          "textPropertyOptions": {
            "operatorOptions": {
              "operatorName": "actor"
            }
          }
        },
        {
          "name": "releaseDate",
          "isReturnable": true,
          "isRepeatable": false,
          "isFacetable": false,
          "datePropertyOptions": {
            "operatorOptions": {
              "operatorName": "released",
              "lessThanOperatorName": "releasedbefore",
              "greaterThanOperatorName": "releasedafter"
            }
          }
        },
        {
          "name": "movieTitle",
          "isReturnable": true,
          "isRepeatable": false,
          "isFacetable": false,
          "textPropertyOptions": {
            "retrievalImportance": "HIGHEST",
            "operatorOptions": {
              "operatorName": "title"
            }
          }
        },
        {
          "name": "genre",
          "isReturnable": true,
          "isRepeatable": true,
          "isFacetable": true,
          "enumPropertyOptions": {
            "operatorOptions": {
              "operatorName": "genre"
            },
            "possibleValues": [
              {
                "stringValue": "Action"
              },
              {
                "stringValue": "Documentry"
              },
              {
                "stringValue": "Drama"
              },
              {
                "stringValue": "Crime"
              },
              {
                "stringValue": "Sci-fi"
              }
            ]
          }
        },
        {
          "name": "userRating",
          "isReturnable": true,
          "isRepeatable": false,
          "isFacetable": true,
          "integerPropertyOptions": {
            "orderedRanking": "ASCENDING",
            "maximumValue": "10",
            "operatorOptions": {
              "operatorName": "score",
              "lessThanOperatorName": "scorebelow",
              "greaterThanOperatorName": "scoreabove"
            }
          }
        }
      ]
    }
  ]
}

Example: Configuration file

The following example configuration file shows the parameter key=value pairs that define an example connector's behavior.

# data source access
api.sourceId=1234567890abcd
api.serviceAccountPrivateKeyFile=./PrivateKey.json

# CSV data structure
csv.filePath=./movie_content.csv
csv.csvColumns=movieId,movieTitle,description,releaseYear,genre,actors,ratings,releaseDate
csv.skipHeaderRecord=true
url.format=https://mymoviesite.com/movies/{0}
url.columns=movieId
csv.datetimeFormat.releaseDate=yyyy-mm-dd
csv.multiValueColumns=genre,actors
csv.multiValue.genre=;
contentTemplate.csv.title=movieTitle

# metadata structured data and content
itemMetadata.title.field=movieTitle
itemMetadata.createTime.field=releaseDate
itemMetadata.contentLanguage.defaultValue=en-US
itemMetadata.objectType=movie
contentTemplate.csv.quality.medium=description
contentTemplate.csv.unmappedColumnsMode=IGNORE

#ACLs
defaultAcl.mode=fallback
defaultAcl.public=true

For detailed descriptions of each parameter, see the Configuration parameters reference.

Run the Cloud Search CSV connector

After you install and set up the Cloud Search CSV connector,

To run the connector from the command line, type the following command:

java -Djava.util.logging.config.file=logging.properties -cp google-cloudsearch-csv-connector-v1-0.0.1-withlib.jar com.google.enterprise.cloudsearch.csvconnector.CSVConnector

You can optionally specify path to configuration file as -Dconfig=path_to_configfile if connector is using configuration file other that connector-config.properties.

By default connector logs are available on standard output. You can log to files by specifying logging.properties.

Send feedback about...

Cloud Search
Cloud Search