You can set up Google Cloud Search to serve web content to your users by deploying the Cloud Search indexer plugin for Apache Nutch, an open source web crawler.
When you start the web crawl, Apache Nutch crawls the web and uses the indexer plugin to upload original binary (or text) versions of document content to the Google Cloud Search API. The Cloud Search API indexes the content and serves the results to your users.
Important considerations
Before you deploy the indexer plugin, be aware of the following considerations.
System requirements
| System requirements | |
|---|---|
| Operating system | Linux only:
|
| Software |
|
| Apache Tika document types | Apache Tika 1.18 supported document formats |
Deploy the indexer plugin
These steps describe how to install the indexer plugin and configure its components to crawl URLs and return results to Cloud Search.
Prerequisites
Before you deploy the indexer plugin, gather the information required to connect Cloud Search and the data source:
- Google Workspace private key (which contains the service account ID). For information on obtaining a private key, go to Configure access to the Cloud Search API.
- Google Workspace data source ID. For information on obtaining a data source ID, go to Add a data source to search.
Step 1: Build and install the plugin software and Apache Nutch
Clone the indexer plugin repository from GitHub.
$ git clone https://github.com/google-cloudsearch/apache-nutch-indexer-plugin.git $ cd apache-nutch-indexer-plugin
Check out the version of the indexer plugin you want:
$ git checkout tags/v1-0.0.5
Build the indexer plugin.
$ mvn package
To skip tests when building the plugin, use
mvn package -DskipTests.Download Apache Nutch 1.15 and follow the Apache Nutch installation instructions.
Extract
target/google-cloudsearch-apache-nutch-indexer-plugin-v1.0.0.5.zipto a folder. Copy theplugins/indexer-google-cloudsearchfolder to the Apache Nutchpluginsfolder (apache-nutch-1.15/plugins).
Step 2: Configure the indexer plugin
To configure the plugin, create a file named plugin-configuration.properties.
The configuration file must specify the following parameters to access the
Cloud Search data source.
| Setting | Parameter |
| Data source ID | api.sourceId = 1234567890abcdef
Required. The Cloud Search source ID that the Google Workspace administrator set up for the indexer plugin. |
| Service account | api.serviceAccountPrivateKeyFile = ./PrivateKey.json
Required. The Cloud Search service account key file that the Google Workspace administrator created for indexer plugin accessibility. |
The following example shows a sample configuration file:
# data source access
api.sourceId=1234567890abcdef
api.serviceAccountPrivateKeyFile=./PrivateKey.json
The configuration file can also contain parameters that control plugin behavior, such as how the plugin pushes data into the Cloud Search API, and how it populates metadata and structured data. For descriptions of these parameters, see Google-supplied connector parameters.
Step 3: Configure Apache Nutch
Open
conf/nutch-site.xmland add the following parameters:Setting Parameter Plugin includes plugin.includes = textRequired. List of plugins to use. This must include at least:
- index-basic
- index-more
- indexer-google-cloudsearch
conf/nutch-default.xmlprovides a default value, but you must manually addindexer-google-cloudsearchto it.Metatags names metatags.names = textOptional. Comma-separated list of tags that map to properties in the corresponding data source schema. To learn more, see Nutch-parse metatags.
The following example shows the required modification to
nutch-site.xml:<property> <name>plugin.includes</name> <value>protocol-(http|httpclient)|urlfilter-regex|index-(basic|more|metadata)|query-(basic|site|url|lang)|indexer-google-cloudsearch|nutch-extensionpoints|parse-(text|html|msexcel|msword|mspowerpoint|pdf|metatags)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value> </property>Open
conf/index-writers.xmland add the following section:<writer id="indexer_google_cloud_search_1" class="org.apache.nutch.indexwriter.gcs.GoogleCloudSearchIndexWriter"> <parameters> <param name="gcs.config.file" value="path/to/sdk-configuration.properties"/> </parameters> <mapping> <copy /> <rename /> <remove /> </mapping> </writer>The
<writer>section contains the following parameters:Setting Parameter Path to Cloud Search configuration file gcs.config.file = pathRequired. The full (absolute) path to the Cloud Search configuration file.
Upload format gcs.uploadFormat = textOptional. The format the plugin uses to push document content to the Cloud Search API. Valid values are:
raw: pushes original, unconverted content.text: pushes extracted textual content. The default israw.
Step 4: Configure web crawl
Before you start a web crawl, configure it to only include information that your organization wants to make available. For more information, see the Nutch tutorial.
Set up start URLs.
Start URLs control where the web crawler begins crawling your content. The crawler must be able to reach all content you want to include by following the links.
To set up start URLs:
- Change to the Nutch installation directory:
$ cd ~/nutch/apache-nutch-X.Y/
- Create a directory for URLs:
$ mkdir urls
- Create a file named
seed.txtand list one URL per line.
- Change to the Nutch installation directory:
Set up follow and do-not-follow rules.
Follow URL rules control which URLs the crawler indexes. Do-not-follow rules exclude URLs from being crawled.
To set up these rules:
- Change to the Nutch installation directory.
- Edit
conf/regex-urlfilter.txt:$ nano conf/regex-urlfilter.txt
Enter regular expressions with a "+" or "-" prefix:
# skip file extensions -\.(gif|GIF|jpg|JPG|png|PNG|ico) # skip protocols (file: ftp: and mailto:) -^(file|ftp|mailto): # allow urls starting with https://support.google.com/gsa/ +^https://support.google.com/gsa/ # accept anything else #+.
Edit the crawl script.
If the
gcs.uploadFormatparameter is missing or set to "raw," you must add-addBinaryContent -base64arguments to thenutch indexcommand. These arguments tell the Nutch Indexer module to include binary content in Base64.- Open the
crawlscript inapache-nutch-1.15/bin. Add the options as shown in this example:
if $INDEXFLAG; then echo "Indexing $SEGMENT to index" __bin_nutch index $JAVA_PROPERTIES "$CRAWL_PATH"/crawldb -addBinaryContent -base64 -linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT echo "Cleaning up index if possible" __bin_nutch clean $JAVA_PROPERTIES "$CRAWL_PATH"/crawldb else echo "Skipping indexing ..."
- Open the
Step 5: Start a web crawl and content upload
After you set up the indexer plugin, you can run it in local mode. Use scripts
from ./bin to execute a crawling job.
The following example assumes components are in the local directory. Run Nutch
from the apache-nutch-1.15 directory:
$ bin/crawl -i -s urls/ crawl-test/ 5
Crawl logs are available in the terminal or the logs/ directory. To direct
logging output, edit conf/log4j.properties.