Deploy a Norconex HTTP Collector Indexer Plugin

This guide is intended for Google Cloud Search Norconex HTTP Collector indexer plugin administrators, that is, anyone who is responsible for downloading, deploying, configuring, and maintaining the indexer plugin. The guide assumes that you are familiar with, Linux operating systems, fundamentals of web crawling, XML and Norconex HTTP Collector.

This guide includes instructions for performing key tasks related to indexer plugin deployment:

  • Download the indexer plugin software
  • Configure Google Cloud Search
  • Configure Norconex HTTP Collector and web crawling
  • Start the web crawl and upload content

Information about the tasks that the Google Workspace administrator must perform to map Google Cloud Search to the Norconex HTTP Collector indexer plugin does not appear in this guide. For information on those tasks, see Manage third-party data sources.

Overview of the Cloud Search Norconex HTTP Collector indexer plugin

By default, Cloud Search can discover, index, and serve content from Google Workspace products, such as Google Docs and Gmail. You can extend the reach of Google Cloud Search to include serving web content to your users by deploying the indexer plugin for Norconex HTTP Collector, an open source enterprise web crawler.

Configuration properties files

To enable the indexer plugin to perform web crawls and upload content to the indexing API, you, as the indexer plugin administrator, provide specific information during the configuration steps described in this document in Deployment steps.

To use the indexer plugin, you must set properties in two configuration files:

  • {gcs-crawl-config.xml}-- contains settings for Norconex HTTP Collector.
  • sdk-configuration.properties-- contains settings for Google Cloud Search.

Properties in each file enable the Google Cloud Search indexer plugin and Norconex HTTP Collector to communicate with each other.

Web crawl and content upload

After you have populated the configuration files, you have the necessary settings to start the web crawl. Norconex HTTP Collector crawls the web, discovering document content that pertains to its configuration and uploads original binary (or text) versions of document content to the Cloud Search indexing API where it gets indexed and ultimately served to your users.

Supported operating system

The Google Cloud Search Norconex HTTP Collector indexer plugin must be installed on Linux.

Supported Norconex HTTP Collector version

The Google Cloud Search Norconex HTTP Collector indexer plugin supports version 2.8.0.

ACL support

The indexer plugin supports controlling access to documents in the Google Workspace domain by using Access Control Lists (ACLs).

If default ACLs are enabled in the Google Cloud Search plugin configuration (defaultAcl.mode set to other than none and configured with defaultAcl.*), the indexer plugin first tries to create and apply a default ACL.

If default ACLs are not enabled, the plugin falls back to giving read permission to the entire Google Workspace domain.

For detailed descriptions of ACL configuration parameters, see Google-supplied connector parameters.

Prerequisites

Before you deploy the indexer plugin, ensure that you have the following required components:

  • Java JRE 1.8 installed on a computer that runs the indexer plugin
  • Google Workspace information required to establish relationships between Cloud Search and Norconex HTTP Collector:

    Typically, the Google Workspace administrator for the domain can supply these credentials for you.

Deployment steps

To deploy the indexer plugin, follow these steps:

  1. Install Norconex HTTP Collector and the indexer plugin software
  2. Configure Google Cloud Search
  3. Configure Norconex HTTP Collector
  4. Configure web crawl
  5. Start a web crawl and content upload

Step 1: Install Norconex HTTP Collector and the indexer plugin software

  1. Download the Norconex commiter software from this page.
  2. Unzip the downloaded software to ~/norconex/ folder
  3. Clone the commiter plugin from GitHub. git clone https://github.com/google-cloudsearch/norconex-committer-plugin.git and then cd norconex-committer-plugin
  4. Check out the desired version of the commiter plugin and build the ZIP file: git checkout tags/v1-0.0.3 and mvn package (To skip the tests when building the connector, use mvn package -DskipTests.)
  5. cd target
  6. Copy the built plugin jar file into the norconex lib directory. cp google-cloudsearch-norconex-committer-plugin-v1-0.0.3.jar ~/norconex/norconex-collector-http-{version}/lib
  7. Extract the ZIP file you just built then unzip the file: unzip google-cloudsearch-norconex-committer-plugin-v1-0.0.3.zip
  8. Execute the install script to copy the plugin's .jar and all the required libraries into the http collector's directory:
    1. Change to the extracted commiter plugin unziped above: cd google-cloudsearch-norconex-committer-plugin-v1-0.0.3
    2. Execute $ sh install.sh and provide the full path to norconex/norconex-collector-http-{version}/lib as the target directory when prompted.
    3. If duplicate jar files are found, select option 1 (Copy source Jar only if greater or same version as target Jar after renaming target Jar).

Step 2: Configure Google Cloud Search

For the indexer plugin to connect to Norconex HTTP Collector and index the relevant content, you must create the Cloud Search configuration file in the Norconex directory where Norconex HTTP Collector is installed. Google recommends that you name the Cloud Search configuration file sdk-configuration.properties.

This configuration file must contain key/value pairs that define a parameter. The configuration file must specify at least the following parameters, which are necessary to access the Cloud Search data source.

Setting Parameter
Data source id api.sourceId = 1234567890abcdef
Required. The Cloud Search source ID set up by the Google Workspace administrator.
Service account api.serviceAccountPrivateKeyFile = ./PrivateKey.json
Required. The Cloud Search service account key file that was created by the Google Workspace administrator for indexer plugin accessibility.

The following example shows an sdk-configuration.propertiesfile.

#
# data source access
api.sourceId=1234567890abcdef
api.serviceAccountPrivateKeyFile=./PrivateKey.json
#

The configuration file can also contain Google-supplied configuration parameters. These parameters can affect how this plugin pushes data into the Google Cloud Search API. For example, the batch.* set of parameters identifies how the connector combines requests.

If you do not define a parameter in the configuration file, the default value, if available, is used. For detailed descriptions of each parameter, see Google-supplied connector parameters.

You can configure the indexer plugin to populate metadata and structured data for content being indexed. Values to be populated for metadata and structured data fields can be extracted from meta tags in HTML content being indexed or default values can be specified in the configuration file.

Setting Parameter
Title itemMetadata.title.field=movieTitle
itemMetadata.title.defaultValue=Gone with the Wind
By default, the plugin uses HTML title as title of document being indexed. In case of missing title, you can either refer to the metadata attribute that contains the value corresponding to the document title or set a default value.
Created timestamp itemMetadata.createTime.field=releaseDate
itemMetadata.createTime.defaultValue=1940-01-17
The metadata attribute that contains the value for the document creation timestamp.
Last modified time itemMetadata.updateTime.field=releaseDate
itemMetadata.updateTime.defaultValue=1940-01-17
The metadata attribute that contains the value for the last modification timestamp for the document.
Document language itemMetadata.contentLanguage.field=languageCode
itemMetadata.contentLanguage.defaultValue=en-US
The content language for documents being indexed.
Schema object type itemMetadata.objectType=movie
The object type used by the site, as defined in the data source schema object definitions. The connector won't index any structured data if this property is not specified.

Note: This configuration property points to a value rather than a metadata attribute, and the .field and .defaultValue sufffixes are not supported.

Datetime formats

Datetime formats specify the formats expected in metadata attributes. If the configuration file does not contain this parameter, default values are used. The following table shows this parameter.

Setting

Parameter

Additional datetime patterns

structuredData.dateTimePatterns=MM/dd/uuuu HH:mm:ssXXX

A semicolon-separated list of additional java.time.format.DateTimeFormatter patterns. The patterns are used when parsing string values for any date or date-time fields in the metadata or schema. The default value is an empty list, but RFC 3339 and RFC 1123 formats are always supported.

Step 3: Configure Norconex HTTP Collector

The zip archive norconex-committer-google-cloud-search-{version}.zipincludes a sample configuration file, minimum-config.xml.

Google recommends that you begin the configuration by copying the sample file:

  1. Change to the Norconex HTTP Collector directory:
    $ cd ~/norconex/norconex-collector-http-{version}/
  2. Copy the configuration file:
    $ cp examples/minimum/minimum-config.xml gcs-crawl-config.xml
  3. Edit the newly created file (in this example, gcs-crawl-config.xml) and add or replace existing <committer> and <tagger> nodes as described in the following table.
Setting Parameter
<committer> node <committer class="com.norconex.committer.googlecloudsearch. GoogleCloudSearchCommitter">

Required. To enable the plugin, you must add a <committer> node as a child of the root <httpcollector> node.
<UploadFormat> <uploadFormat>raw</uploadFormat>
Optional. The format in which the indexer plugin pushes document content to the Google Cloud Search indexer API. Valid values are:
  • raw: the indexer plugin pushes original, unconverted document content.
  • text: the indexer plugin pushes extracted textual content.

The default value is raw.
BinaryContent Tagger <tagger> node <tagger class="com.norconex.committer.googlecloudsearch.BinaryContentTagger"/>
Required if the value of <UploadFormat> is raw. In this case, the indexer plugin needs the binary content field of the document to be available.

You must add the BinaryContentTagger <tagger> node as a child element of the <importer> / <preParseHandlers> node.

The following example shows the required modification to gcs-crawl-config.xml.

<committer class="com.norconex.committer.googlecloudsearch.GoogleCloudSearchCommitter">
    <configFilePath>/full/path/to/gcs-sdk-config.properties</configFilePath>
    
    <uploadFormat>raw</uploadFormat>
</committer>
<importer>
  <preParseHandlers>
    <tagger class="com.norconex.committer.googlecloudsearch.BinaryContentTagger"/>
  </preParseHandlers>
</importer>

Step 4: Configure web crawl

Before starting a web crawl, you must configure the crawl so that it only includes information that your organization wants to make available in search results. The most important settings for web crawl are part of the <crawler> node(s) and can include:

  • Start URLs
  • Maximum depth of the crawl
  • Number of threads

Change these configuration values according to your needs. For more detailed information about setting up a web crawl, as well as a full list of available configuration parameters, see the HTTP Collector's Configuration page.

Step 5: Start a web crawl and content upload

After you have installed and have set up the indexer plugin, you can run it on its own in local mode.

The following example assumes the required components are located in the local directory on a Linux system. Run the following command:

$ ./collector-http[.bat|.sh] -a start -c gcs-crawl-config.xml

Monitor the crawler with JEF Monitor

Norconex JEF (Job Execution Framework) Monitor is a graphical tool for monitoring the progress of the Norconex Web Crawler (HTTP Collector) processes and jobs. For a complete tutorial on how to set up this utility, visit Monitor your crawler's progress with JEF Monitor.