Deploy an Apache Nutch Indexer Plugin

You can set up Google Cloud Search to serve web content to your users by deploying the Google Cloud Search indexer plugin for Apache Nutch, an open source web crawler.

When you start the web crawl, Apache Nutch crawls the web and uses the indexer plugin to upload original binary (or text) versions of document content to the Google Cloud Search indexing API. The indexing API indexes the content and serves the results to your users.

Important considerations

System requirements

System requirements
Operating system Linux only:
  • Ubuntu
  • Red Hat Enterprise Linux 5.0
  • SUSE Enterprise Linux 10 (64 bit)
Software
  • Apache Nutch version 1.15. The indexer plugin software includes this version of Nutch.
  • Java JRE 1.8 installed on the computer that will run the indexer plugin
Apache Tika document types Apache Tika 1.18 supported document formats

Deploy the indexer plugin

The following steps describe how to install the indexer plugin and configure its components to crawl the specified URLs and return the results to Cloud Search.

Prerequisites

Before you deploy the Cloud Search Apache Nutch indexer plugin, gather the information required to connect Google Cloud Search and the data source:

Step 1: Build and install the plugin software and Apache Nutch

  1. Clone the indexer plugin repository from GitHub.

    $ git clone https://github.com/google-cloudsearch/apache-nutch-indexer-plugin.git
    $ cd apache-nutch-indexer-plugin
  2. Check out the desired version of the indexer plugin:

    $ git checkout tags/v1-0.0.5
  3. Build the indexer plugin.

    $ mvn package

    To skip the tests when building the indexer plugin, use mvn package -DskipTests.

  4. Download Apache Nutch 1.15 and follow the Apache Nutch installation instructions.

  5. Extract target/google-cloudsearch-apache-nutch-indexer-plugin-v1.0.0.5.zip (built in step 2) to a folder. Copy the plugins/indexer-google-cloudsearch folder to the Apache Nutch install plugins folder (apache-nutch-1.15/plugins).

Step 2: Configure the indexer plugin

To configure the Apache Nutch Indexer Plugin, create a file called plugin-configuration.properties.

The configuration file must specify the following parameters, which are necessary to access the Google Cloud Search data source.

Setting Parameter
Data source ID api.sourceId = 1234567890abcdef
Required. The Google Cloud Search source ID that the Google Workspace admin set up for the indexer plugin.
Service account api.serviceAccountPrivateKeyFile = ./PrivateKey.json
Required. The Google Cloud Search service account key file that the Google Workspace admin created for indexer plugin accessibility.

The following example shows a sample configuration file with the required parameters.

#
# data source access
api.sourceId=1234567890abcdef
api.serviceAccountPrivateKeyFile=./PrivateKey.json
#

The configuration file can also contain other parameters that control indexer plugin behavior. You can configure how the plugin pushes data into the Cloud Search API, defaultAcl.* and batch.*. You can also configure how the indexer plugin populates metadata and structured data.

For descriptions of these parameters, go to Google-supplied connector parameters.

Step 3: Configure Apache Nutch

  1. Open conf/nutch-site.xml and add the following parameters:

    Setting Parameter
    Plugin includes plugin.includes = text

    Required. List of plugins to use. This must include at least:

    • index-basic
    • index-more
    • indexer-google-cloudsearch
    conf/nutch-default.xml provides a default value for this property, but you must also manually add indexer-google-cloudsearch to it.
    Metatags names metatags.names = text

    Optional. Comma-separated list of tags that map to properties in the corresponding data source's schema. To learn more about how to set up Apache Nutch for metatags, go to Nutch-parse metatags.

    The following example shows the required modification to nutch-site.xml:

    <property>
      <name>plugin.includes</name>
      <value>protocol-(http|httpclient)|urlfilter-regex|index-(basic|more| metadata)|query-(basic|site|url|lang)|indexer-google-cloudsearch|nutch-extensionpoints|parse-(text|html|msexcel|msword|mspowerpoint|pdf|metatags)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
    </property>
    
  2. Open conf/index-writers.xml and add the following section:

    <writer id="indexer_google_cloud_search_1" class="org.apache.nutch.indexwriter.gcs.GoogleCloudSearchIndexWriter">
      <parameters>
        <param name="gcs.config.file" value="path/to/sdk-configuration.properties"/>
      </parameters>
      <mapping>
        <copy />
        <rename />
        <remove />
      </mapping>
    </writer>
    

    The <writer> section contains the following parameters:

    Setting Parameter
    Path to Google Cloud Search configuration file gcs.config.file = path

    Required. The full (absolute) path to the Google Cloud Search configuration file.

    Upload format gcs.uploadFormat = text

    Optional. The format in which the indexer plugin pushes document content to the Google Cloud Search indexer API. Valid values are:

    • raw: the indexer plugin pushes original, unconverted document content.
    • text: the indexer plugin pushes extracted textual content. The default value is raw.

Step 4: Configure web crawl

Before you start a web crawl, configure the crawl so that it only includes information that your organization wants to make available in search results. This section provides an overview; for more information about how to set up a web crawl, go to the Nutch tutorial.

  1. Set up start URLs.

    Start URLs control where the Apache Nutch web crawler begins crawling your content. The start URLs should enable the web crawler to reach all content that you want to include in a particular crawl by following the links. Start URLs are required.

    To set up start URLs:

    1. Change the working directory to the nutch installation directory:

      $ cd ~/nutch/apache-nutch-X.Y/
    2. Create a directory for urls:

      $ mkdir urls
    3. Create a file named seed.txt and list URLs in it with 1 URL per line.

  2. Set up follow and do-not-follow rules.

    Follow URL rules control which URLs are crawled and included in the Google Cloud Search index. The web crawler checks URLs against the follow URL rules. Only URLs that match these rules are crawled and indexed.

    Do-not-follow rules exclude URLs from being crawled and included in the Google Cloud Search index. If a URL contains a do not crawl pattern, the web crawler does not crawl it.

    To set up follow and do-not-follow URL rules:

    1. Change the working directory to the nutch installation directory:

      $ cd ~/nutch/apache-nutch-X.Y/
    2. Edit conf/regex-urlfilter.txt to change follow/do-not-follow rules: \

      $ nano conf/regex-urlfilter.txt
    3. Enter regular expressions with a "+" or "-" prefix to follow / do-not-follow URL patterns and extensions, as shown in the following examples. Open-ended expressions are allowed.

      # skip file extensions
      -\.(gif|GIF|jpg|JPG|png|PNG|ico)
      
      # skip protocols (file: ftp: and mailto:)
          -^(file|ftp|mailto):
      
      # allow urls starting with https://support.google.com/gsa/
      +^https://support.google.com/gsa/
      
      # accept anything else
      # (commented out due to the single url-prefix allowed above)
      #+.
      
  3. Edit the crawl script.

    If the gcs.uploadFormat parameter is missing or set to "raw," you must add "-addBinaryContent -base64" arguments to pass to the nutch index command. These arguments tell the Nutch Indexer module to include binary content in Base64 when it invokes the indexer plugin. The ./bin/crawl script doesn't have these arguments by default.

    1. Open crawl script in apache-nutch-1.15/bin.
    2. Add the -addBinaryContent -base64 options to the script, as in the following example:

            if $INDEXFLAG; then
                echo "Indexing $SEGMENT to index"
                __bin_nutch index $JAVA_PROPERTIES "$CRAWL_PATH"/crawldb -addBinaryContent -base64 -linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT
      
                echo "Cleaning up index if possible"
                __bin_nutch clean $JAVA_PROPERTIES "$CRAWL_PATH"/crawldb
            else
                echo "Skipping indexing ..."
      

Step 5: Start a web crawl and content upload

After you install and set up the indexer plugin, you can run it on its own in local mode. Use the scripts from ./bin to execute a crawling job or individual Nutch commands.

The following example assumes the required components are located in the local directory. Run Nutch with the following command from the apache-nutch-1.15 directory:

$ bin/crawl -i -s urls/ crawl-test/ 5

Crawl logs are available on the std output (terminal) or in logs/directory. To direct the logging output or for more verbose logging, edit conf/log4j.properties.