Google Search Appliance software version 6.0
Posted June 2009
- About this Document
- Supported Databases
- Overview of Database Crawling and Serving
- Configuring Database Crawling and Serving
- Starting Database Synchronization
- Frequently Asked Questions
This document describes how the Google Search Appliance crawls databases and serves results from them. This document is intended for search appliance administrators who need to understand:
- How to configure a database crawl
- How to start a database crawl
- How to troubleshoot a database crawl
The following table lists the major sections in this document.
|Supported Databases||The relational databases that the Google Search Appliance can crawl|
|Overview of Database Crawling and Serving||How the Google Search Appliance crawls a database, uploads database content for inclusion in the index, and serves and displays search results from a database|
|Configuring Database Crawling and Serving||How to configure a Google Search Appliance for database crawling and serving|
|Starting Database Synchronization||How to start the process of crawling a database and uploading content for inclusion in the search index|
|Troubleshooting||How to troubleshoot a database crawl|
|Frequently Asked Questions||Questions about database crawling and serving|
The Google Search Appliance provides access to data stored in relational databases by crawling the content directly from the database and serving the content. To access content in a database, the Google Search Appliance sends SQL (Structured Query Language) queries using JDBC (Java Database Connectivity) adapters provided by each database company.
The following table lists databases and JDBC adapter versions that the Google Search Appliance supports.
|DB2 ®||IBM®DB2 Universal Database (UDB) 22.214.171.124|
|MySQL®||MySQL Connector/J 3.1.13|
|Oracle®||Oracle Database 10g Release 2, 10.1.0.2.0 driver|
|SQL Server™||Microsoft® SQL Server™ 2005 JDBC™ Driver 1.0|
|Sybase®||jConnect™ for JDBC™ 5.5 Build 25137|
The Google Search Appliance has two built-in components for crawling and serving content from databases:
- TableCrawler--A custom connector used for pushing records from a database into the appliance’s index using feeds.
- TableServer--A component used for serving search results.
TableServer connects to the database when a serve query is defined and the user clicks on a search result from the database.
The following diagram provides an overview of the major database crawl and serve processes:
See the following sections in this document for descriptions of each major process. For an explanation of the symbols used in the diagram, refer to About the Diagrams in this Section.
The process of crawling a database is called synchronizing a database. Full database synchronizations are always manual and you must start one by using the Crawl and Index > Databases page. The Google Search Appliance does not currently support scheduled database synchronization.
After you start database synchronization, the TableCrawler and TableServer use JDBC adapters to connect to the specified database. They connect by using information that you provide when you configure database crawling and serving.
When you start database synchronization, the TableCrawler component of the Google Search Appliance performs the following steps:
- Connects to a relational database.
- Crawls the contents of the database.
- Pushes records from a database into the appliance’s index using feeds.
Specifically, the TableCrawler sends the database the SQL query that you entered in the Crawl Query field, using the JDBC database client libraries.
The results are wrapped in Feed XML (eXtensible Markup Language) syntax, and include a record for each row of the database crawl query results. This database feed file is presented to the Feeder system as soon as the crawl is complete.
To prevent the Google Search Appliance from deleting database content from the index when its PageRank™ is low and the index has reached its license limit, all database data is locked.
Once synchronization is complete, the database feed is automatically uploaded to the search appliance for inclusion in the index. The crawl query is used to produce a feed description. All feeds, including database feeds, share the same namespace. Database source names should not match existing feed names.
Unique Generated URLs
The database synchronization process generates the URL attribute. Note that the IP address of the Google Search Appliance is used rather than the name in the URL.
Pages created from the database being indexed all have the form shown in the following example.
Therefore, you need to enter this pattern in Follow and Crawl Only URLs with the Following Patterns on the Crawl and Index > Crawl URLs page in the Admin Console. For more details see Setting the URL Patterns to Enable Database Crawl.
The Feeds connector generates a URL from either the Primary Key columns or from the URL column as specified in Serve URL field on the Crawl and Index > Databases page. A unique hash value is generated from the primary key to form part of the URL.
These generated URLs have the form shown in the following example.
Once the the search appliance index has been updated with the Feed XML file, the content is available for serving in approximately 30 minutes. The following sections describe how the Google Search Appliance performs the following actions:
Generating Search Results
The linked content from search results is generated by the serve query at serve time. If a user’s query returns results that were originally retrieved from a database query, each result links to content that is queried from the database at serve time. The associated snippets are generated from the index.
If the database has changed since the last database synchronization, the resulting page may not relate to the original user search.
If the database serve query has changed since the database was last synchronized, the search results may produce pages that do not match the user’s query.
Displaying Search Results
The TableServer displays search results for a query from the index. The search is made over the indexed content and shows URLs and snippet information from the crawl query results. The result links direct to URLs that are either:
- Generated on the search appliance (when Serve Query is used), or
- Obtained from the database (when Serve URL Field is used)
The TableServer is not used if Serve URL Field is selected. Serve URL Field indicates the column in the database that contains a URL to display for each row, when the user clicks on the result of a search.
When Serve URL Field is selected, the database stylesheet is only used to format the database data for indexing and for the snippets. It is not used by the referenced URL from Serve URL Field.
Before you can start database syncronization, you must configure database crawling and serving by performing the following tasks:
Providing Database Data Source Information
This information enables the crawler to access content stored in the database and to format search results. Database data source information includes the following items:
- Source name--Name of the data source
- Database type--Choose from IBM DB2, Oracle, MySQL, MySQL Server, or Sybase
- Hostname--Name of the database server in fully-qualified domain name format (for example, db.mycompany.com) resolvable by the DNS server on the appliance (an IP address can also be used)
- Port--The port number that is open to the database that JDBC adapter should connect to
- Database Name--The name given to the database
- Username--User name to access the database
- Password--Password for the database
- Crawl query--A SQL query for the database that returns all rows to be indexed
- Serve query--A SQL statement that returns a row from a table or joined tables which matches a search query.
- Data Display/Usage--The stylesheet used to format the content of the Feed XML document
- Advanced Settings--Incremental crawl query, BLOB fields
You provide database data source information by using the Create New Database Source section on the Crawl and Index > Databases page in the Admin Console. To navigate to this page, click Crawl and Index > Databases.
For complete information about the Create New Database Source section, click Help Center > Crawl and Index > Databases in the Admin Console.
An SQL crawl query must be in the form shown in the following example.
SELECT <table.column> [, <table.column>, ...] FROM <table> [, <table>, ... [WHERE some_condition_or_join]
Each row result corresponds to a separate document. The information retrieved from the crawl query provides the data for the indexing.
An SQL serve query is used when a
user clicks on a search result link, to retrieve and display the desired
document data from the database.
A serve query displays result data
using the ’?’ in the
WHERE clause to allow for particular row selection and display. The Primary Key Fields must provide
the column names for the field to substitute with the ’?’.
Primary Key Fields are column heading names (separated by commas) such as
Crawl and Serve Query Examples
This section shows example crawl and serve queries for an employee database with these fields:
employee_id, first_name, last_name, email, dept
The following example shows the crawl query.
SELECT employee_id, first_name, last_name, email, dept FROM employee
The following example shows the serve query.
SELECT employee_id, first_name, last_name, email, dept
WHERE employee_id = ?
The Primary Key field for this case must be
employee_id. The ’?’ signifies that
this value is provided at serve time, from the search result
that the user clicks.
For a table with multiple column Primary Keys, if the
dept is unique, you can use
multiple bind variables. The crawl query for this example is the same as shown in this section. The following example shows the serve query.
SELECT employee_id, first_name, last_name, email, dept
WHERE employee_id = ? AND dept = ?
- SQL keywords are in uppercase by convention. Uppercase is not required.
- The ’?’ is substituted with a real column value to identify a particular record to be displayed when a user clicks on a database search result.
- The URL accessed by the user is partly generated from the Primary Keys; the database query is made based on the serve query and the substituted Primary Key values. The possible values for this column are obtained from those listed in the results from the crawl query.
- The column names specified in the
WHEREclause must be included in the same order in the Primary Key Fields.
The Feed XML and Stylesheet
You specify a stylesheet for formatting the content of the Feed XML document by using the stylesheet specified in the Data Display/Usage section of the Crawl and Index > Databases page in the Admin Console. This stylesheet defines the formatting used between each record.
You can use the default database stylesheet, or upload your own. To view the default database stylesheet, download it from this link: dbdefault.xsl. You can make changes to it, and then upload it by using the Upload stylesheet field on the Crawl and Index > Databases page.
Each row returned by the database is represented by a unique URL in the appliance index.
The following example shows internally stored output of a database sync, using the default database stylesheet. In this example, the database hostname is mydb.mycompany.com and the database source name is DB_SOURCE_NAME.
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE gsafeed SYSTEM "http://ent1:7800/gsafeed.dtd"> <gsafeed> <header> <datasource>DB_SOURCE_NAME</datasource> <feedtype>full</feedtype> </header> <group> <record url="googledb://mydb.mycompany.com/DB_SOURCE_NAME/azE9MSwwOTk4T0U3NTAwNisrKysrKysrKyZrMj0yLDA" action="add" mimetype="text/html" lock="true">
<content><![CDATA[<html> <head> <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<body style="font-family: arial, helvetica, sans-serif;">
<H2 align="center">Database Result</H2>
<table cellpadding="3" cellspacing="0" align="center"
<th style="border-bottom: #ffcc00 1px solid;" align="left">
<th style="border-bottom: #ffcc00 1px solid;" align="left">
<th style="border-bottom: #ffcc00 1px solid;" align="left">
<tr valign="top" bgcolor="white">
<td><font size="-1">Second column content data.</font></td>
<td><font size="-1">Third column content data.</font></td>
The Advanced Settings section of the Database Datasource Information contains options for configuring an incremental crawl query and BLOB type and content. The following table describes the advanced settings.
|Incremental Crawl Query||Provides a means for the appliance to update the index
of database data, without having to retrieve the entire
contents of an unconstrained query.
The Incremental Crawl Query requires a modified version of the Crawl Query. It must include a last_modified_time condition of the following form:
One of the joined tables must have a modification time column. The time format used for modification times is
YYYY-MM-DD HH:MM:SS and will be in GMT. Also the column must have a date data type.
Incremental feeds and full feeds allow for deletion and addition of data. These take the following form:
|Action Field||The Action column must specify either
"add" or "delete".
The database administrator should populate the ’action’ column using database triggers. The ’action’ column need not be part of the source table, but instead part of a separate logging table which is joined with the source table holding the content by means of primary keys. The database administrator should purge the logging information of all entries dated before the last successful incremental crawl.
|BLOB MIME Type Field|| The name of the column that contains the
standard Internet MIME type values of Binary Large Objects, such as text/plain
Database feeds do support content in BLOB columns. The MIME type information must be supplied as a column. BLOBs use Base64 binary encoding. The XSL transformation from the specified stylesheet is not applied to BLOB data, or its associated row.
BLOBs are automatically binary encoded as Base64 when it is crawled by the TableCrawler. BLOBs will display HTML snippets but their links will be to the original binary format (e.g. MS Word, PDF). The cache link for the snippet will provide an HTML representation of the binary data.
Multiple BLOBs in a single query are not supported. A CLOB can be treated as a BLOB column or as text.
The search appliance usually transforms data from crawled pages, which protects against security vulnerabilities. If you cause the search appliance to crawl BLOB content by filling in these advanced settings, certain conditions could open a vulnerability. The vulnerability exists only if both of these conditions are true:
|BLOB Content Field||The name of the column that contains the BLOB content, of type described in the BLOB MIME Type Field.|
Setting URL Patterns to Enable Database Crawl
When you set up a database crawl you need to include entries in the Follow and Crawl Only URLs with the Following Patterns fields on the Crawl and Index > Crawl URLs page of the Admin Console.
To include all database feeds, use the following crawl pattern:
To include a specific database feed, use the following crawl pattern:
URLs and URL patterns are case sensitive. If you use uppercase for the database source name, you must use the same uppercase in the crawl start URLs and crawl patterns.
If your data source contains a URL column with URLs that point to your own website, add those URL patterns under Follow and Crawl Only URLs with the Following Patterns on the Crawl and Index > Crawl URLs page.
For complete information about the Crawl and Index > Crawl URLs page, click Help Center > Crawl and Index > Crawl URLs in the Admin Console.
For more information about URL patterns, see Constructing URL Patterns.
After you configure database crawling and serving, you can start synchronizing a database by using the Crawl and Index > Databases page in the Admin Console.
To synchronize a database:
- Click Crawl and Index > Databases.
- In the Current Databases section of the page, click the Sync link next to the database that you want to synchronize.
The database synchronization runs until it is complete.
After you click Sync, the link label changes to Sync’ing, which indicates that the database crawl is in process. When the crawl completes, Sync’ing no longer appears. However, to see the updated status for a database synchronization, you must refresh the Crawl and Index > Databases page. To refresh the page, navigate to another page in the Admin Console then navigate back to the Crawl and Index > Databases page.
You can synchronize several databases at the same time. The resulting feeds are also processed concurrently. For any given database, the search appliance must finish synchronization before it can begin to process the database feed.
You can also use the Current Databases section of the Crawl and Index > Databases page to edit database data source information for a database, delete a database, or view log information for a database.
For complete information about the Current Databases section, click Help Center > Crawl and Index > Databases in the Admin Console.
Monitoring a Feed
You can monitor the progress of a database feed by using the Crawl and Index > Feeds page in the Admin Console. This page shows all feed status, including the automatically-entered database feed. When a feed process successfully finishes, it is marked completed on the page.
For complete information about the Crawl and Index > Feeds page, click Help Center > Crawl and Index > Feeds in the Admin Console.
For more information about feeds, refer to the Feeds Protocol Developers Guide.
This section contains recommended actions for troubleshooting problems when database synchronization does not appear to result in a feed.
Verify the hostname and port
Verify that a database process is listening on the host and port that has been specified on the Crawl and Index > Databases page. For example, the default port for MySQL is 3306. Run the following command and verify that you get a connection. Be sure to test this from a computer on the same subnet as the appliance.
telnet <dbhostname> 3306
Verify the remote database connection
Check that you are able to connect to the database from a remote computer using the connection parameters. For example, run the following command line to connect to a remote MySQL database:
mysql -u<username> -p<password> -h<dbhostname>
In this example, the MySQL client must be available on the computer used.
Check the database logs
Look at the logs on the database server to see whether there was a successful connection.
Check for a database URL in the Follow and Crawl Patterns
Ensure that a suitable follow and crawl URL pattern for the database is entered on the Crawl and Index > Crawl URLs page in the Admin Console. If there is no suitable pattern:
- Enter the following pattern in Follow and
Crawl Only URLs with the Following Patterns:
- Click the Save URLs to Crawl button to save your changes.
Check the Serve Query or Serve URL
On the Crawl and Index > Databases page, make sure there is either a valid entry in either Serve Query or Serve URL Field.
Check the SQL query
Make sure the SQL queries are valid given primary key substitutions for the ’?’ value. Test the query by using the SQL client for the database being used.
Verify a known-working crawl query
Verify that a known-working crawl query, which produces a small number of results, works correctly.
Check database networking
tcpdump on the traffic between the appliance and the
database server on the specified port when you do a sync. Compare to a
tcpdump from a successful connection.
Check the feed
If the sync has completed successfully, troubleshoot the feed. For information about troubleshooting feeds, refer to Feeds Protocol Developer Guide.
Q: Can multiple databases be synchronized at one time?
Q: What would happen if a new database synchronization were started and completed while a previous one is still being processed by the feeder? Is the feeder restarted with the new feed or the new feed just queued up behind the one running?
A: Database feeds are processed in queued order. If a feed is being processed, a new feed of the same description is queued up behind the one running.
Q: When will the database be resynchronized?
A: A database is only synchronized fully when you click the Sync link on the Crawl and Index > Databases page in the Admin Console. Incremental synchronizations are a more efficient mechanism for updating the index for large query result sets which only change partially. Neither incremental nor full database syncs are scheduled automatically.
Q: Is there a way to schedule a database crawl?
A: There is currently no way to schedule a database crawl. It can be synchronized manually from the Crawl and Index > Databases page.
Q: Can a sync be stopped once it’s in progress?
A: There is currently no way to stop a database synchronization once it is started.
Q: How can I tell the status of the sync?
A: On the Crawl and Index > Databases page, the Sync link under Current Databases reads Sync’ing when the link is clicked. Go to a different page in the Admin Console and returning to the Crawl and Index > Databases page to refresh and update the status. After a successful database synchronization, a feed appears on the Crawl and Index > Feeds page.
Q: Sun defines several types of JDBC adapters. Which ones can be used?
A: The Google Search Appliance supports the Java to DB direct adapter types.
Q: What SQL is supported in the database configuration page? SQL99? ANSI SQL?
A: This is dependent on the JDBC adapters used to connect with your database.
Q: Can the list of JDBC adapters used be updated?
A: Currently the Google Search Appliance does not support modification of the JDBC adapter list.
Q: Is the database connection secure?
A: The Google Search Appliance does not provide an encrypted connection channel to the database unless such is supported by the JDBC adapter. It is recommended that this connection be hosted over a private network link and that JDBC communication is assumed to be insecure.
Q: Can the Google Search Appliance crawl the access information from the database?
A: The Google Search Appliance does not support evaluation of database ACLs.
Q: Can the Fully Qualified Domain Name be used for the crawl patterns?
A: No. Currently, the crawl patterns must specify the appliance using its IP address.
Q: Can the result of a database synchronization be translated by using a customized XSLT to reduce the number of documents counted?
A: The Google Search Appliance does not support translation of the database feed itself. Each row from the crawl query counts as a crawled document. The number of rows produced can only be reduced by optimizing the crawl query.
Q: Is there a DTD for the XML produced from the database for providing a custom stylesheet?
A: There is no DTD for the database XML, as the structure is dependent on the SELECT clause of the crawl query. An identity transformation stylesheet is available, which allows you to see the raw database XML structure, as it appears before it is transformed by a stylesheet.
To use the identity_transform.xsl and see the raw XML:
- Download identity_transform.xsl.
- Provide information about the database using the Create New Database Source section of the Crawl and Index > Databases page.
- Enter identity_transform.xsl in the Upload Stylesheet field.
- Make sure the Serve Query and Primary Key Fields are completed.
- Click Create Database Data Source.
- Click Sync for the database.
- When the database synchronization and feed are completed, perform a test search of the database content. Click a result that you know should be provided by the database. The displayed page should appear as unformatted HTML.
- To view the XML, view the source of the page in your web browser.
Q: How does the search appliance invalidate the database crawled contents of the index for a particular collection?
A: When you click the Sync link on the Crawl and Index > Databases page in the Admin Console for a database source name, the old contents in the index are invalidated.