Deploy an Apache Nutch Indexer Plugin

This guide is intended for Google Cloud Search Apache Nutch Indexer Plugin administrators, that is, anyone who is responsible for downloading, deploying, configuring, and maintaining the indexer plugin. The guide assumes that you are familiar with Linux operating systems, fundamentals of web crawling and Apache Nutch.

This guide includes instructions for performing key tasks related to indexer plugin deployment:

  • Download the indexer plugin software
  • Configure Google Cloud Search
  • Configure Apache Nutch and web crawling
  • Start the web crawl and content upload

Information about the tasks that the G Suite administrator must perform to map Google Cloud Search to the Nutch indexer plugin does not appear in this guide. For information on those tasks, see Manage third-party data sources.

Overview of the Google Cloud Search Indexer Plugin for Apache Nutch

By default, Google Cloud Search can discover, index, and serve content from G Suite data such as Google Docs and Gmail. You can extend the reach of Google Cloud Search to include serving web content to your users by deploying the indexer plugin for Apache Nutch, an open source web crawler.

Configuration properties files

To enable the indexer plugin to perform web crawls and upload content to the indexing API, you, as the indexer plugin administrator, provide specific information to the connector during the configuration steps described in this document in Deployment steps.

To use the indexer plugin, you must set properties in two configuration files:

  • nutch-site.xml-- settings for Apache Nutch web crawler.
  • sdk-configuration.properties -- settings for Google Cloud Search.

Properties in each file enable the Google Cloud Search indexer plugin and Apache Nutch to communicate with each other.

Web crawl and content upload

After you have populated the configuration files, you have the necessary settings to start the web crawl. Apache Nutch crawls the web, discovering document content that pertains to its configuration. Using the indexer plugin, it uploads original binary (or text) versions of document content to the Google Cloud Search indexing API where it gets indexed and ultimately served to your users.

Supported operating system

The Google Cloud Search Indexer Plugin for Apache Nutch must be installed on Linux.

Supported Apache Nutch version

The Google Cloud Search Indexer Plugin for Apache Nutch supports Nutch version 1.14. The indexer plugin software includes this version of Nutch.

Apache Tika Supported Document Types

Apache Nutch version 1.14 relies on Apache Tika version 1.17 for content parsing. For a list of document types indexable by the Apache Nutch indexer plugin, refer to Apache Tika Supported Document Formats.

ACL support

The indexer plugin supports controlling access to documents in the G Suite domain by using Access Control Lists (ACLs).

If default ACLs are enabled in the Google Cloud Search plugin configuration (defaultAcl.mode set to other than none and configured with defaultAcl.*), the indexer plugin first tries to create and apply a default ACL.

If default ACLs are not enabled, the plugin falls back to giving read permission to the entire G Suite domain.

For detailed descriptions of ACL configuration parameters, see Google-supplied connector parameters.

Prerequisites

Before you deploy the indexer plugin, ensure that you have the following required components:

  • Java JRE 1.8 installed on a computer that runs the indexer plugin
  • G Suite information required to establish relationships between Google Cloud Search and Apache Nutch:

    Typically, the G Suite administrator for the domain can supply these credentials for you.

Deployment steps

To deploy the indexer plugin, follow these basic steps:

  1. Install Apache Nutch and the indexer plugin software
  2. Configure Google Cloud Search
  3. Configure Apache Nutch
  4. Configure web crawl
  5. Start a web crawl and content upload

Step 1: Install Apache Nutch and the indexer plugin software

The Google Cloud Search indexer plugin software must be installed on a host machine. Google provides the plugin software in the following pre-built binary distribution:

apache-nutch-1.14-v1-0.0.2-bin.tar.gz

The binary distribution also includes the Google Cloud Search Connector SDK.

To install Apache Nutch and the Google Cloud Search indexer plugin:

  1. Download the Apache Nutch indexer plugin.

  2. Unzip apache-nutch-1.14-v1-0.0.2-bin.tar.gz.
    This action creates a folder, "apache-nutch-1.14-SNAPSHOT," that contains both Apache Nutch and the Google Cloud Search indexer plugin.

To deploy an Apache Nutch Indexer Plugin, you need to create a Google Cloud Search configuration file called sdk-configuration.properties. This file must contain key/value pairs defining configuration information required by the connector.

The configuration file must specify the following parameters, which are necessary to access the Google Cloud Search data source.

Setting Parameter
Data source id api.sourceId = 1234567890abcdef
Required. The Google Cloud Search source ID set up by the G Suite administrator.
Service account api.serviceAccountPrivateKeyFile = ./PrivateKey.json
Required. The Google Cloud Search service account key file that was created by the G Suite administrator for indexer plugin accessibility.

The following example shows a Google Cloud Search configuration file.

#
# data source access
api.sourceId=1234567890abcdef
api.serviceAccountPrivateKeyFile=./PrivateKey.json
#

The configuration file may also contain several other Google Cloud Search-specific configuration parameters, which may affect how the indexer plugin pushes data into the Google Cloud Search API. Examples of these parameters include defaultAcl.* and batch.*For detailed descriptions of each parameter, see Google-supplied connector parameters.

You can configure the indexer plugin to populate metadata and structured data for content being indexed. Values to be populated for metadata and structured data fields can be extracted from meta tags in HTML content being indexed or default values can be specified in the configuration file.

Setting Parameter
Title itemMetadata.title.field=movieTitle
itemMetadata.title.defaultValue=Gone with the Wind
By default, the plugin uses the HTML title as title of document being indexed. In case of missing title, you can either refer to the metadata attribute that contains the value corresponding to the document title or set a default value.
Created timestamp itemMetadata.createTime.field=releaseDate
itemMetadata.createTime.defaultValue=1940-01-17
The metadata attribute that contains the value for the document creation timestamp.
Last modified time itemMetadata.updatetime.field=releaseDate
itemMetadata.updatetime.defaultValue=1940-01-17
The metadata attribute that contains the value for the last modification timestamp for the document.
Document language itemMetadata.contentLanguage.field=languageCode
itemMetadata.contentLanguage.defaultValue=en-US
The content language for documents being indexed.
Schema object type itemMetadata.objectType=movie
The object type used by the site, as defined in the data source schema object definitions. The connector won't index any structured data if this property is not specified.

Note: This configuration property points to a value rather than a metadata attribute, and the .field and .defaultValue sufffixes are not supported.

Datetime formats

Datetime formats specify the formats expected in metadata attributes. If the configuration file does not contain this parameter, default values are used. The following table shows this parameter.

Setting

Parameter

Additional datetime patterns

structuredData.dateTimePatterns=MM/dd/uuuu HH:mm:ssXXX

A semicolon-separated list of additional java.time.format.DateTimeFormatter patterns. The patterns are used when parsing string values for any date or date-time fields in the metadata or schema. The default value is an empty list, but RFC 3339 and RFC 1123 formats are always supported.

Step 3: Configure Apache Nutch

The tarball apache-nutch-1.14-SNAPSHOT-bin.tar.gz includes the Apache Nutch configuration file, nutch-site.xml.

This file contains the required values for the plugin.includes property:

index-basic
index-more
indexer-google-cloud-search

You must modify nutch-site.xml by adding the following parameters, which are necessary for interaction with Google Cloud Search.

Setting Parameter
Path to Google Cloud Search configuration file gcs.config.file = TBS Required. The full (absolute) path to the Google Cloud Search configuration file.
Upload format gcs.uploadFormat = text Optional. The format in which the indexer plugin pushes document content to the Google Cloud Search indexer API. Valid values are:
  • raw: the indexer plugin pushes original, unconverted document content.
  • text: the indexer plugin pushes extracted textual content. The default value is raw.

The following example shows the required modification to nutch-site.xml.

<property>
  <name>gcs.config.file</name>
  <value>/path/to/sdk-configuration.properties</value>
  <description>Location of GCS Connector SDK configuration file.</description>
</property>

Step 4: Configure web crawl

Before starting a web crawl, you must configure the crawl so that it only includes information that your organization wants to make available in search results. This section includes basic information on how to:

For more detailed information about setting up a web crawl, see the Nutch tutorial.

Set up start URLs

Start URLs control where the Apache Nutch web crawler begins crawling your content. The web crawler should be able to reach all content that you want to include in a particular crawl by following the links from one or more of the start URLs. Start URLs are required.

To set up start URLs:

  1. Change the working directory to the nutch installation directory:
    $ cd ~/nutch/apache-nutch-X.Y/
  2. Create a directory for urls:
    $ mkdir urls
  3. Create a file named seed.txt and write URLs into it (1 per line):
    $ nano urls/seed.txt

Set up follow and do-not-follow rules

Follow URL rules control which URLs are crawled and included in the Google Cloud Search index. Before crawling any URLs, the web crawler checks them against follow URL rules. Only URLs that match these rules are crawled and indexed.

Do-not-follow rules exclude URLs from being crawled and included in the Google Cloud Search index. If a URL contains a do not crawl pattern, the web crawler does not crawl it.

To set up follow and do-not-follow URL rules:

  1. Change the working directory to the nutch installation directory:
    $ cd ~/nutch/apache-nutch-X.Y/
  2. Edit conf/regex-urlfilter.txt to change follow/do-not-follow rules:
    $ nano conf/regex-urlfilter.txt
  3. Place regex expressions (open-ended is fine) with "+" or "-" prefix to follow / do-not-follow URL patterns, extensions, and so on, as shown in the following examples..

Examples:

# skip file extensions
-\.(gif|GIF|jpg|JPG|png|PNG|ico)

# skip protocols (file: ftp: and mailto:)
    -^(file|ftp|mailto):

# allow urls starting with https://support.google.com/gsa/
+^https://support.google.com/gsa/

# accept anything else
# (commented out due to the single url-prefix allowed above)
#+.

Edit the crawl script

If the gcs.uploadFormat parameter is missing or set to "raw," you must add "-addBinaryContent -base64" arguments to be passed to the nutch index command. These arguments tell the Nutch Indexer module to include binary content in Base64 when invoking the indexer plugin. The ./bin/crawl script does not have these arguments by default.

Edit crawl.sh script in apache-nutch-1.14-SNAPSHOT/bin and add:

      if $INDEXFLAG; then
          echo "Indexing $SEGMENT to index"
          __bin_nutch index $JAVA_PROPERTIES "$CRAWL_PATH"/crawldb  -addBinaryContent -base64 -linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT

          echo "Cleaning up index if possible"
          __bin_nutch clean $JAVA_PROPERTIES "$CRAWL_PATH"/crawldb
      else
          echo "Skipping indexing ..."

Step 5: Start a web crawl and content upload

After you have installed and have set up the indexer plugin, you can run it on its own in local mode. Use the scripts from ./bin to execute a crawling job or individual nutch commands.

The following example assumes the required components are located in the local directory on a Linux system. Run nutch with the following command from apache-nutch-1.14-SNAPSHOT folder:

bin/crawl -i -s urls/ crawl-test/ 5

Crawl logs are available on the std output (terminal) or in logs/directory. To direct the logging output or for more verbose logging, edit conf/log4j.properties.

Send feedback about...

Cloud Search
Cloud Search