This guide is for administrators responsible for downloading, deploying, and maintaining the Google Cloud Search Norconex HTTP Collector indexer plugin. You should be familiar with Linux, web crawling fundamentals, XML, and Norconex HTTP Collector.
This guide includes instructions to:
- Download the indexer plugin software.
- Configure Cloud Search.
- Configure Norconex HTTP Collector and web crawling.
- Start the web crawl and upload content.
Information about the tasks the Google Workspace administrator must perform doesn't appear in this guide. For information on those tasks, see Manage third-party data sources.
Overview of the Norconex HTTP Collector indexer plugin
By default, Cloud Search can discover, index, and serve content from Google Workspace products, such as Google Docs and Gmail. You can extend this to include web content by deploying the indexer plugin for Norconex HTTP Collector, an open source enterprise web crawler.
Configuration properties files
To enable the plugin to crawl and upload content, you must provide specific information in two configuration files:
{gcs-crawl-config.xml}: settings for Norconex HTTP Collector.sdk-configuration.properties: settings for Cloud Search.
Web crawl and content upload
After you populate the configuration files, you can start the web crawl. Norconex HTTP Collector crawls the web and uploads original binary or text document content to the Cloud Search indexing API.
System requirements
- Operating system: Linux only.
- Norconex version: Version 2.8.0.
- Software: Java JRE 1.8.
ACL support
The indexer plugin supports Access Control Lists (ACLs) to control access to documents in the Google Workspace domain.
If you enable default ACLs in the plugin configuration (defaultAcl.mode set to
other than none), the plugin applies these defaults. Otherwise, the plugin
grants read permission to the entire domain. See
Google-supplied connector parameters.
Prerequisites
Before you deploy the indexer plugin, gather these components:
- Google Workspace private key (containing the service account ID). See Configure access to the Cloud Search API.
- Google Workspace data source ID. See Manage third-party data sources.
Deployment steps
- Install Norconex HTTP Collector and the plugin software
- Configure Cloud Search
- Configure Norconex HTTP Collector
- Configure web crawl
- Start a web crawl and content upload
Step 1: Install Norconex HTTP Collector and the plugin software
- Download the Norconex committer software from the Norconex download page.
- Extract the software to
~/norconex/. Clone the committer plugin:
git clone https://github.com/google-cloudsearch/norconex-committer-plugin.git cd norconex-committer-pluginCheck out your selected version and build the plugin:
git checkout tags/v1-0.0.3 mvn packageTo skip tests, use
mvn package -DskipTests.Copy the JAR file to the Norconex
libdirectory:cp target/google-cloudsearch-norconex-committer-plugin-v1-0.0.3.jar ~/norconex/norconex-collector-http-VERSION/libExtract the built ZIP file:
unzip target/google-cloudsearch-norconex-committer-plugin-v1-0.0.3.zip cd google-cloudsearch-norconex-committer-plugin-v1-0.0.3Run the install script and provide the full path to the Norconex
libdirectory:sh install.shIf prompted for duplicate files, select option
1.
Step 2: Configure Cloud Search
Create sdk-configuration.properties in the Norconex directory. The file must
specify these parameters:
| Setting | Parameter |
| Data source ID | api.sourceId = 1234567890abcdef
Required. The source ID from your Google Workspace administrator. |
| Service account | api.serviceAccountPrivateKeyFile = ./PrivateKey.json
Required. The service account key file. |
Example sdk-configuration.properties:
# data source access
api.sourceId=1234567890abcdef
api.serviceAccountPrivateKeyFile=./PrivateKey.json
You can also include parameters like batch.* to control how the plugin pushes
data. See
Google-supplied connector parameters.
To populate metadata, configure these optional parameters:
| Setting | Parameter |
| Title | itemMetadata.title.field=movieTitle |
| Schema object type | itemMetadata.objectType=movie |
Step 3: Configure Norconex HTTP Collector
The plugin include a sample file, minimum-config.xml.
Change to the Norconex directory and copy the sample:
cd ~/norconex/norconex-collector-http-VERSION/ cp examples/minimum/minimum-config.xml gcs-crawl-config.xmlEdit
gcs-crawl-config.xmlto add or replace<committer>and<tagger>nodes:
| Setting | Parameter |
<committer> node |
<committer class="com.norconex.committer.googlecloudsearch.GoogleCloudSearchCommitter">
Required. Add this under the <httpcollector> node. |
<uploadFormat> |
<uploadFormat>raw</uploadFormat>
Optional. raw or text. Default is
raw. |
Example gcs-crawl-config.xml:
<committer class="com.norconex.committer.googlecloudsearch.GoogleCloudSearchCommitter">
<configFilePath>/full/path/to/gcs-sdk-config.properties</configFilePath>
<uploadFormat>raw</uploadFormat>
</committer>
<importer>
<preParseHandlers>
<tagger class="com.norconex.committer.googlecloudsearch.BinaryContentTagger"/>
</preParseHandlers>
</importer>
Step 4: Configure web crawl
Configure the <crawler> nodes for your needs, including:
- Start URLs
- Maximum crawl depth
- Number of threads
See the Norconex configuration page.
Step 5: Start a web crawl and content upload
Run the collector in local mode:
./collector-http[.bat|.sh] -a start -c gcs-crawl-config.xml
Monitor the crawler with JEF Monitor
Norconex JEF (Job Execution Framework) Monitor provides a graphical view of progress. See Monitor your crawler with JEF Monitor.