Google Search Appliance software version 4.6
Google Mini software version 4.6
Posted July 2007
This document provides an overview of how the Google Search Appliance and the Google Mini crawl and index enterprise content.
For the Google Search Appliance, information about continuous crawl applies to software version 4.2, and information about full crawl and file system crawl applies to software version 4.6 and later.
For the Google Mini, all information applies to software version 4.4 and later.
Before the Google Search Appliance or Google Mini crawls your enterprise content, people in various roles may want to prepare the content to meet the objectives described in the following table.
|Control access to a content server||Content server administrator, webmaster|
|Control access to a Web page||Search appliance administrator, webmaster, content owner, and/or content server administrator|
|Control indexing of parts of a Web page|
|Control access to files and subdirectories|
|Ensure that the search appliance can crawl a file system|
The Google Search Appliance and Google Mini always obey the rules in robots.txt and it is not possible to override this feature. However, this type of file is not mandatory. When a robots.txt file present, it is located in the Web server's root directory.
Before the search appliance crawls any content servers in your environment, check with the content server administrator or webmaster to ensure that robots.txt allows the search appliance user agent access to the appropriate content.
If any hosts require authentication before serving robots.txt, you must configure authentication credentials using the Crawl and Index > Crawler Access page in the Admin Console.
In Google Search Appliance software versions 4.6.4.G.44 and later, the search appliance user agent (gsa-crawler) obeys an extension to the robots.txt standard called " Allow." This extension may not be recognized by all other search engine crawlers, so check with other search engines you're interested in finding out. The Allow directive works exactly like the Disallow directive. Simply list a directory or page you want to allow.
You may want to use Disallow and Allow together. For example, to block access to all pages in a subdirectory except one, use the following entries:
User-Agent: gsa-crawler Disallow: /folder1/ Allow: /folder1/myfile.html
This blocks all pages inside the folder1 directory except for myfile.html.
To prevent the search appliance crawler (as well as other crawlers) from indexing or following links in a specific HTML document, embed a Robots META tag in the head of the document. The search appliance crawler obeys the noindex, nofollow, and noarchive META tags. Refer to the following table for details about Robots META tags, including examples.
|noindex||The search appliance crawler retrieves and archives the document in the search appliance cache, but does not index it. The document is counted as part of the license limit.||<META NAME="robots" CONTENT="noindex">|
|nofollow||The search appliance crawler retrieves and archives the document in the search appliance cache, but does not follow links on the Web page to other documents. The document is counted as part of the license limit.||<META NAME="robots" CONTENT="nofollow">|
|noarchive||The search appliance crawler retrieves and indexes the document, but does not archive it in its cache. The document is counted as part of the license limit.||<META NAME="robots" CONTENT="noarchive">|
You can combine any or all of the Robots META tags into a single META tag, for example:
<META NAME="robots" CONTENT="noarchive, nofollow">
Currently, it is not possible to set NAME="gsa-crawler" to limit these restrictions to the search appliance.
If the search encounters a robots META tag when fetching a URL, it schedules a retry after a certain time interval. For URLs excluded by robots META tags, the maximum retry interval is one month.
There may be Web pages that you want to suppress from search results when users search on certain words or phrases. For example, if a Web page consists of the text "the user conference page will be completed as soon as Jim returns from medical leave," you might not want this page to appear in the results of a search on the terms "user conference."
You can prevent this link from being indexed using googleoff/googleon tags. By embedding googleon/googleoff tags with their flags in HTML documents, you can disable:
For details about each googleon/googleoff flag, refer to the following table.
|index||Words between the tags are not indexed as occurring on the current page.||fish <!--googleoff: index-->shark
|The words fish and mackerel are indexed for this page, but the occurrence of shark is not indexed.
This page could appear in search results for the term shark only if the word appears elsewhere on the page or in anchortext for links to the page.
Hyperlinks that appear within these tags are followed.
|anchor||Anchor text that appears between the tags and in links to other pages is not indexed. This prevents the index from using the hyperlink to associate the link text with the target page in search results.||<!--googleoff: anchor--><A href=sharks_rugby.html>
shark </A> <!--googleon: anchor-->
|The word shark is not associated with the page sharks_rugby.html. Otherwise this hyperlink would cause the page sharks_rugby.html to appear in the search results for the term shark.|
|snippet||Text between the tags is not used to create snippets for search results.||<!--googleoff: snippet-->Come to the fair!
|The text Come to the fair! does not appear in snippets with the search results.|
|all||Turns on all the attributes. Text between the tags is not indexed, followed to another linked-to page, or used for a snippet.||<!--googleoff: all-->Come to the fair!
|The text Come to the fair! is not indexed, is not associated with anchor text, and does not appear in snippets with the search results.|
If a URL appears within googleoff and googleon tags, the search appliance crawls the URL.
The search appliance does not crawl any directories named "no_crawl." You can prevent the search appliance from crawling files and directories by:
This method blocks the search appliance from crawling everything in the no_crawl directory, but it does not provide directory security or block people from accessing the directory.
End users can also use no_crawl directories on their local computers to prevent personal files and directories from being crawled.
In a Windows network file system, folders and drives can be shared. A shared folder or drive is available for any person, device, or process on the network to use. To enable the search appliance to crawl your file system, do the following:
The search appliance crawls content by following newly discovered links in pages that it crawls. If your enterprise content includes unlinked URLs that are not listed in the follow and crawl patterns, the search appliance crawler will not find them on its own. In addition to adding unlinked URLs to follow and crawl patterns, you can force unlinked URLs into a crawl using one or both of the following types of pages:
Both of these types of pages allow users or crawlers to navigate all the pages within a Web site. To include a jump page or site map in the crawl, add the URL for the page or map to the crawl path.
Before starting a crawl, you must configure the crawl path so that it only includes information that your organization wants to make available in search results. To configure the crawl, use the Crawl and Index > Crawl URLs page in the Admin Console to enter URLs and URL patterns in the following boxes:
Note: URLs are case-sensitive.
For complete information about the Crawl URLs page, click Help Center > Crawl and Index > Crawl URLs in the Admin Console.
Start URLs control where the search appliance begins crawling your content. The search appliance should be able to reach all content that you want to include in a particular crawl by following the links from one or more of the start URLs. Start URLs are required.
Start URLs must be fully qualified URLs in the following format:
The information in the curly brackets is optional.
Typically, start URLs include your company's home site, as shown in the following example:
Enter start URLs in the Start Crawling from the Following URLs section on the Crawl and Index > Crawl URLs page in the Admin Console. To crawl content from multiple Websites, add start URLs for them.
Follow and crawl URL patterns control which URLs are crawled and included in the index. Before crawling any URLs, the search appliance checks them against follow and crawl URL patterns. Only URLs that match these URL patterns are crawled and indexed. You must include all start URLs in follow and crawl URL patterns.
The following example shows a follow and crawl URL pattern:
Given this follow and crawl URL pattern, the search appliance crawls the following URLs because each one matches it :
However, the search appliance does not crawl the following URL because it does not match the follow and crawl pattern:
The following table provides examples of how to use follow and crawl URL patterns to match sites, directories, and specific URLs.
|To Match||Expression Format||Example|
|URLs from all sites in the same domain||<domain>/||mycompany.com/|
|URLs that are in a specific directory or in one of its subdirectories||<site>/<directory>/||sales.mycompany.com/products/|
|A specific file||<site>/<directory>/<file>||www.mycompany.com/products/index.html|
For more information about writing URL patterns, see Constructing URL Patterns.
Enter follow and crawl URL patterns in the Follow and Crawl Only URLs with the Following Patterns section on the Crawl and Index > Crawl URLs page in the Admin Console.
Do not crawl URL patterns exclude URLs from being crawled and included in the index. If a URL contains a do not crawl pattern, the search appliance does not crawl it. Do not crawl patterns are optional.
Enter do not crawl URL patterns in the Do Not Crawl URLs with the Following Patterns section on the Crawl and Index > Crawl URLs page in the Admin Console.
To prevent specific file types, directories, or other sets of pages from being crawled, enter the appropriate URLs in this section. Using this section, you can:
For your convenience, this section is prepopulated with many URL patterns and file types, some of which you may not want the search appliance to index. To make a pattern or file type unavailable to the search appliance crawler, remove the # (comment) mark in the line containing the file type. For example, to make Excel files on your servers unavailable to the crawler, change the line
To confirm that URLs can be crawled, you can use the Pattern Tester Utility page. This page finds which URLs will be matched by the patterns you have entered for:
To use the Pattern Tester Utility page, click Test these patterns on the Crawl and Index > Crawl URLs page. For complete information about the Pattern Tester Utility page, click Help Center > Crawl and Index > Crawl URLs in the Admin Console.
As when crawling HTTP or HTTPS web-based content, the search appliance uses URLs to refer to individual objects that are available on SMB-based file systems, including files, directories, shares, hosts.
Use the following format for an SMB URL:
When the crawler sees a URL in this format, it treats string1 as the hostname and string2 as the share name, with the remainder as the path within the share. Do not enter a workgroup in an SMB URL.
The following example shows a valid SMB URL for crawl:
The following table describes all of the required parts of a URL that are used to identify an SMB-based document.
|Protocol||Indicates the network protocol that is used to access the object.||smb://|
|Hostname||Specifies the DNS host name or WINS name of the SMB server. A hostname can be one of the following:|
|A fully qualified domain name||fileserver.mycompany.com|
|An unqualified hostname||fileserver|
|An IP Address||10.0.0.100|
|Share name||Specifies the name of the share to use. A share is tied to a particular host, so two shares with the same name on different hosts do not necessarily contain the same content.||myshare|
|File path||Specifies the path to the document, relative to the root share.||If myshare on myhost.mycompany.com shares all the documents under the C:\myshare directory, the file C:\myshare\mydir\mydoc.txt is retrieved by the following: smb://myhost.mycompany.com/myshare/
|Forward slash||SMB URLs use forward slashes only. Some environments, such as Microsoft Windows systems, use backslashes ("\") to separate file path components. Even if you are referring to documents in such an environment, use forward slashes for this purpose.||Microsoft Windows style: C:\myshare\
SMB URL: smb://myhost.mycompany.com/myshare/
Some SMB file share implementations allow:
The file system crawler does not support these URL schemes.
SMB URLs can refer to objects other than files, including directories, shares, and hosts. The file system gateway, which interacts with the network file shares, treats these non-document objects like documents that do not have any content, but do have links to certain other objects. The following table describes the correspondence between objects that the URLs can refer to and what they actually link to.
|URL Refers To||URL Links To||Example|
|Directory||Files and subdirectories contained within the directory||smb://fileserver.mycompany.com/myshare/mydir/|
|Share||Files and subdirectories contained within the share's top-level directory||smb://fileserver.mycompany.com/myshare/|
|Host||Each share on the host. See also "Share Names" in the previous table.||smb://fileserver.mycompany.com/|
Hostname resolution is the process of associating a symbolic hostname with a numeric address that is used for network routing. For example, the symbolic hostname www.google.com resolves to the numeric address 10.0.0.100.
File system crawling supports two methods of resolving hostnames:
During setup, the search appliance requires that at least one DNS server be specified. If a WINS server is available, you may specify it using the Administration > Network Settings page in the Admin Console.
If both DNS and WINS are configured, the file system gateway first attempts to resolve hostnames used in SMB file shares using WINS. If the hostname is not resolvable (or a WINS server is not configured), the appliance attempts to use DNS to look up the hostname.
WINS is not used to resolve hostnames for non-file-share content and should not be specified when the search appliance will not crawl an SMB file share, or your network does not have a WINS server.
The information in this document describes crawling public content.
For dates to be properly indexed and searchable by date range, they must be in ISO 8601 format:
The following example shows a date in ISO 8601 format:
For a date in a META tag to be indexed, not only must it be in ISO 8601 format, it must also be the only value in the content. For example, the date in the following META tag can be indexed:
<META name="date" content="2007-07-11">
The date in the following meta tag cannot be indexed because there is additional content:
<META name="date" content="2007-07-11 is a date">
Documents can have dates explicitly stated in these places:
To define a rule that the search appliance crawler should use to locate document dates in documents for a particular URL, use the Crawl and Index > Document Dates page in the Admin Console. If you define more than one document date rule for a URL, the search appliance applies the rules in the order in which you enter them.
To configure document dates:
For complete information about the Document Dates page, click Help Center > Crawl and Index > Document Dates in the Admin Console.