Custom Search

Annotations: Defining Sites to Search

This page describes how to define the coverage of your search engine using a TSV file or XML annotations file.

Contents

This page includes the following sections:

Overview

Adding sites individually using the Custom Search Control Panel can be tedious if you're building a large search engine. In addition, managing a large collection of sites in the Control Panel isn't fun either. Instead, you can add and manage a lot of sites by listing them in an annotations file and uploading it. In addition, annotations files—particularly XML ones—give you far greater control over the ranking of search results.

An annotations file is simply a list of annotations. Each annotation has two components: the site and its associated labels. The label tells Custom Search how to handle a site; that is, whether a site should be included, excluded, promoted, or demoted. In the context file, you define labels; in the annotations file, you tag sites with the appropriate labels.

Annotations files can be in any of the following formats:

When you start editing your annotations file, start out with a small number of annotations, and then test some search queries in the Preview tab of the Control Panel. It's easier to test and troubleshoot your search engine with a handful of annotations. When you get the results that you expect, incrementally add more annotations.

You can either upload the annotations file to the Control Panel or host it in your own website. For details about file limits, see the Annotations Limits section.

Back to top

Choosing the Right Format

Before you start creating annotations, determine which file format best suits your needs. If your search engine increases in complexity, you can consider using multiple annotations files, even files of different formats. For example, you can upload OPML annotations files generated by other sites and XML annotations files you created. Custom Search combines all the annotations files in all your search engines into a single XML annotations file.

Use the following table to pick the appropriate format:

To create Use Because Limitations More information
A search engine with an existing OMPL file (feed-based search engine) OPML format You do not need to recreate annotations if you already have OPML files with URL patterns. You can upload the existing file directly to the Control Panel. You cannot directly fine-tune the ranking of search results. Using the OPML format
A search engine that does not need all the advanced features TSV format You can create and manage the annotations in a more readable format.

You can use a spreadsheet editor.

You can take advantage of many advanced features, such as applying labels, associating scores, adding comments.

You can create your own attributes. However, they are mostly for your own use; Custom Search does not do anything with them.
You cannot refer to another Custom Search file, and this is not the best option for programmatically created search engines. Using the TSV format
A complex and heavily customized search engine Custom Search XML format It's the most powerful format. It is appropriate for developers who want to create advanced search engines with bells and whistles. It gives you more flexibility and greater control over the the ranking of your search results.

If you programmatically generate custom search engines or if you use third-party tools to generate custom search engines, and you want to host the specifications in your own website, you have to use this format.

It is the most complex format. Using the XML format

Back to top

Using the OPML Format

OPML is a type of XML format that was originally developed for defining ordered lists of elements or outlines, but it is now also commonly used for web feeds. OPML specification.

If you have OPML files from some feed aggregators, you can upload the OPML file without bothering with typing each site. Custom Search grabs the value of the OPML attribute htmlUrl and adds it to the list of sites to search. You can upload multiple OPML files for each of your search engines.

Here's an example of an OPML file:

<opml version="1.0">
  <head>
    <title>Bicycles</title>
    <dateCreated>Fri Mar 14 23:21:11 PDT 2008</dateCreated>
    <dateModified>Fri Mar 14 23:21:11 PDT 2008</dateModified>
  </head>
  <body>
    <outline type="rss" text="Road Bikes" xmlUrl="http://www.google.com/exampleurl.opml" 
         htmlUrl="http://www.google.com/sampleurl1.opml"/>
    <outline type="rss" text="Mountain Bikes" xmlUrl="http://www.google.com/exampleurl2.opml"
         htmlUrl="http://www.google.com/sampleurl2.opml"/>
  </body>
</opml>

When you upload an OPML file in the Control Panel, Custom Search automatically converts OPML to Custom Search XML. It adds search engine labels (<Label name="_cse_example"/>) and scores (score="1"). Mpre information about scores.

The following is an example of an OPML file that has been converted to have Custom Search XML:

<GoogleCustomizations>
   <Annotations>
     <Annotation about="www.google.com/exampleurl1.opml" score="1">
       <Label name="_cse_example"/>
     </Annotation>
     <Annotation about="www.google.com/exampleurl2.opml" score="1">
       <Label name="_cse_example"/>
     </Annotation>
   </Annotations>
</GoogleCustomizations>

Back to top

Using the TSV Format

If you don't plan to host files in your own website and you don't have OPML files, you can create annotations using a text file with tab-separated values (TSV).

You can use a plain text editor or a spreadsheet editor to create the file. It does not matter what you name the file, so long as you save it with the file extension .tsv (for example, cse_bicycles.tsv). If you are using a plain text editor, separate each element by a single tab character. Do not try to prettify and align the lines with multiple tab characters. If you are using a spreadsheet editor, allocate a column for each of the fields.

Each line of text in your TSV file can list a site and its associated labels.

Elements of a Custom Search TSV

Your TSV files must begin with a heading that enumerates the fields that you will be using in the subsequent annotation lines. The headings are case-sensitive, so follow the capitalization in this guide. The order of the heading elements doesn't really matter, but the annotation lines that follow the heading must follow the order of the headings. When you create the headings, you are essentially creating columns of data, so you can't just plug the annotation data any which way.

A heading has the following fields:

  • URL - The URL pattern of the site.

  • Label - The search engine label or refinement label that should be applied to the site. You can get the labels for your search engine from the Context section of the Advanced tab in the Control Panel. You'll find at least two search engine or background labels: one for adding sites to your custom search engine and one for excluding sites from it. If you have not changed the search engine label, the label for including sites is in the form of _cse_xxxxxxxxxxx, where x is a character, and the label for excluding sites is in the form of _cse_exclude_xxxxxxxxxxx. To avoid errors, copy and paste these labels instead of typing them by hand.
  • Comment - Optional. Notes about each annotation.
  • Score - Optional. Discussed in detail in the Ranking Search Results page.
  • Custom Field - Optional. Your own attributes. To create an attribute, just prefix it with "A=". For example, to create a date attribute, use "A=Date". Custom Search does not process these fields.

Each subsequent line corresponds to an annotation. It provides the values for the fields that were defined in the headings.

Back to top

TSV Example

Let's look at an example of a basic TSV file.

URL        Label
www.webmd.com/hw/*     _cse_Ansi-stoubiq
www.webmd.com/hw/cancer/*     _cse_exclude_Ansi-stoubiq

The example has a heading with the two required fields: URL and Label. The two annotation lines supply the values for the fields. The label in the first annotation line, _cse_Ansi-stoubiq, adds the site, www.webmd.com/hw/*, to the search engine. The other label, _cse_exclude_Ansi-stoubiq, excludes the site, www.webmd.com/hw/cancer/*, from the search engine.

You can add more fields to your TSV annotations. Here's an exmple that includes a Comment field and a custom field, A=Date.

URL     Label     Comment     A=Date
www.cancer.gov/cancertopics/types/liver/*     _cse_Ansi-stoubiq     government site     20060504
www.medicinenet.com/liver_cancer/*     _cse_Ansi-stoubiq     site on symptoms     20060504
www.webmd.com/hw/cancer/*     _cse_Ansi-stoubiq     great site for patients!     20060504
www.oncologychannel.com/*/treatment     _cse_Ansi-stoubiq     20060504

Even though you added new fields in the header, you are not obligated to supply the values for all them, which is why it's fine for the last line to not have a comment. But that's not the case for URL and Label, which are required fields.

Back to top

Using the Custom Search XML Format

If you want to take advantage of all the features available in the Custom Search API, XML is the way to go. You can create XML annotations files in three ways. The following table describes the different strategies for the XML format. It's just a matter of preference, so you should not worry too much about picking the right way to annotate your search engine. If you change your mind, you can reorganize your annotations by cutting and pasting.

To do this Use Because Limitations More information
Keep the annotations for each search engine separate An external annotations file for each search engine Custom Search merges all annotations into a single annotations file, but you can create and upload them separately. Each file pertains to a search engine. If there's overlap between search engines, you might end up managing the same sites in multiple places. XML Annotations
Pool all annotations across all your search engines in a single place One external annotations file shared by all search engines Having all annotations in a single file lets you manage annotations across all search engines.

A communal annotations file enables you to list sites only once, yet have the flexibility to change inclusion, exclusion, and ranking of the same sites for various search engines.

For example, one of your search engines could restrict its search to five sites, another could eliminate those sites, and yet another could promote those sites.

If you have a lot of annotations, it could be hard to manage the file. You always have to verify that you are changing the annotations for the right search engine.

XML Annotations
Host the files in your website and keep both the context and annotation data of the search engine in a single file Context files with inline annotations A single file is easier to manage than a search engine that has a context file and an external annotations file. Just create the annotations section right after the context specification. Use this format only if you are hosting the file on your website.

If you have multiple search engines that are fairly similar, you might end up managing the same sites in multiple places.

Inline Annotations

When you upload your files in the Control Panel, Custom Search merges all your annotations into a single annotations file that is shared by all your search engines. This is the annotations file you download from the Control Panel. You can distinguish the annotations by their search engine labels (the value in the Label element and the name attribute).

<Annotation about="http://www.solarenergy.org/*">
   <Label name="_cse_abcdefghijk"/>
</Annotation>

If you prefer to keep the annotations for each search engine separate, you should maintain the original annotations files and upload them to the Control Panel when you make changes. To keep things simple, stick with using the XML format. Do not alternate between using the XML format and the Sites tab in the Control Panel to include or exclude sites, because changes made to the Sites tab are appended to the communal annotations file and you'll have to copy these new annotations to your copy of the annotations file.

Back to top

XML Annotations

The following is an example of XML annotations. It is roughly the XML version of the TSV example in the previous section. It includes the same elements, except for custom attributes, which are available only in the TSV format. This annotations file tells Custom Search to include everything under www.webmd.com/hw/* but exclude everything under www.webmd.com/hw/cancer/*.

<Annotations>
  <Annotation about="www.cancer.gov/cancertopics/types/liver/*">
    <Label name="_cse_Ansi-stoubiq"/>
    <Comment>government site</Comment>
  </Annotation>
  <Annotation about="www.medicinenet.com/liver_cancer/">
    <Label name="_cse_exclude_Ansi-stoubiq"/>
    <Comment>site on symptoms</Comment>
  </Annotation>
  <Annotation about="www.webmd.com/hw/cancer/*">
    <Label name="_cse_exclude_Ansi-stoubiq"/>
    <Comment>great sites for patients!</Comment>
  </Annotation>
  <Annotation about="www.oncologychannel.com/*/treatment">
    <Label name="_cse_exclude_Ansi-stoubiq"/>
  </Annotation>
</Annotations>

The annotations file has four elements in the following hierarchy:

  • Annotations (root element)
    • Annotation
      • Label
      • Comment (optional)

To programatically upload annotations using HTTP methods, you must use additional elements that tell the Custom Search API what to do with the annotations, such as whether they should be added or removed. For more information see Programmatically Creating Custom Search Engines.

Back to top

Creating External Annotations

To list sites you want your search engine to cover, do the the following:

  1. Start the file with the <Annotations></Annotations> root element.
  2. Create an annotation by adding the <Annotation></Annotation> tags, and then define the about attribute with the URL pattern of the site.
    <Annotations>
       <Annotation about="www.webmd.com/hw/cancer/*">
       </Annotation>
       </Annotations>
    
  3. Associate the site with the search engine by using the <Label name=" "/> tag, and specify how that site should be treated by the search engine. You can get the labels for your search engine from the Context section of the Advanced tab in the Control Panel. You'll find two labels: one for adding sites to your custom search engine and one for excluding sites from it. If you have not changed the name of the search engine label in the context file, the label for including sites is in the form of _cse_xxxxxxxxxxx, where x is a character, and the label for excluding sites is in the form of _cse_exclude_xxxxxxxxxxx. To avoid errors, copy and paste these labels instead of typing them by hand.
       <Annotations>
       <Annotation about="http://www.solarenergy.org/*">
         <Label name="_cse_abcdefghijk"/>
       </Annotation>
    </Annotations>
    ;

    A single site can have multiple labels associated with it, and it can be treated differently by different search engines. For example, the site http://www.solarenergy.org/ can be included in both your solar energy search engine and excluded from your bike search engine. The same site will be ranked differently in the result pages of different search engines.

    If you have changed the name of the label in the context file, remember to update the Label name values in your annotations files.

  4. To add more sites, create and define another Annotation element.
  5. Save the XML file.

Back to top

Inline Annotations

An inline annotation is just like an external annotation, except that it is embedded inside the context file. In essence, you are creating a Custom Search file with two sections: the CustomSearchEngine section, which houses the context or search engine specification, and the Annotations section, which houses the annotations or sites information. You can use files of this format only if you are hosting them from your website. You will not be able to upload this file in the Control Panel.

When you combine the context and annotations in one file, you have to start with the GoogleCustomizations root element. The file has the following structure:

  • GoogleCustomizations (root element)
    • CustomSearchEngine
      • Title
      • Description
      • Context
        • BackgroundLabels
          • Label
      • LookAndFeel
    • Annotations
      • Annotation
        • Label
        • Comment (optional)

Here's an example of inline annotations.

<GoogleCustomizations>
  <CustomSearchEngine>
   <!--For brevity, other elements have been excluded....--> 
   
    <Context>
      <BackgroundLabels>
        <Label name="_cse_solar_example" mode="FILTER"/>
        <Label name="_cse_exclude_solar_example" mode="ELIMINATE"/>
      </BackgroundLabels>
    </Context>
  </CustomSearchEngine>
  <Annotations>  
    <!--Include this site in the search results--> 
    <Annotation about="http://www.solarenergy.org/*">
      <Label name="_cse_solar_example"/>
    </Annotation>
    <!--Include this site in the search results-->
    <Annotation about="http://www.solarfacts.net/*">
      <Label name="_cse_solar_example"/>
    </Annotation>
    <!--Exclude this site from the search results--> 
    <Annotation about="http://en.wikipedia.org/wiki/*">
      <Label name="_cse_exclude_solar_example"/>
    </Annotation>
   </Annotations>
</GoogleCustomizations>

Back to top

Improving Search Coverage

Custom Search is built on top of the Google index. This means that webpages that are in the Google index are available to your search engine; conversely, webpages that have not been crawled by Google will not show up in your search results. If you want your custom search engine to include sites that are not in the Google index, submit a Sitemap to the Indexing tab of the Control Panel or directly to Google Webmaster Tools.

A Sitemap includes a list of pages in your site, as well as information about the update frequency of the webpages and their importance relative to each other. Submitting a Sitemap helps Google discover your webpages and improve the crawling schedule. To learn more about Sitemaps, see the Webmaster Help Center and Using the Sitemap Protocol. If you are interested in building fancier Sitemaps, see http://www.sitemaps.org/protocol.php.

Submitting Sitemaps is particularly helpful if your site has the following:

  • Dynamic content
  • Webpages that aren't easily discovered by Googlebot (Google's web crawler), such as pages with rich AJAX or Flash features
  • Few websites linking to it.

    Googlebot crawls the web by following links from one page to another, so if your site isn't well linked, it is hard for the crawler to discover it. If your website is new, probably not many websites are pointing to your site.

  • A large archive of content pages that does not have a strong network of cross-linking

Google can index only pages it can access. So, if you use robots.txt file or robots meta tags in your webpages, make sure those pages don't block crawlers.

Improved coverage is not instantaneous, as it takes some time for the pages to be crawled and indexed. But once your webpages are in the index, they could appear in both Google search and your custom search engine.

Back to top

Improving Search Freshness

If you are not creating or maintaining a search engine that searches just your website, you can skip this section. You cannot apply the strategies discussed in this section to websites that you do not own or manage.

After you submit a Sitemap, Google will start crawling some or all of the webpages, and, over time, the search results for your website would improve. But if you can't wait and you want certain webapges crawled and indexed within the next 24 hours, you can expedite the crawling of your most important webpages by going to Indexing tab of the Control Panel and clicking the Index Now button under the Indexing section.

Note: Custom Search and Google search use different selection criteria, therefore submitting pages for on-demand indexing will not make them appear any faster in the Google search index.

For each search engine in your account, you can submit one Sitemap for on-demand indexing. Custom Search will crawl 10 webpages that you have marked with the highest priority values in your Sitemap. If more than 10 webpages have the highest priority values, Custom Search will crawl the highest priority pages with the most recent last modified date. If you have upgraded to Google Site Search, you have a higher limit for on-demand indexing. The limit, which starts from 50 webpages, varies according to your account level.

If you have more than 10 webpages that you want indexed immediately, you can resubmit an updated Sitemap with the next 10 webpages marked with the highest priority value. If you had just clicked the Index Now button in the Control Panel, wait 24 hours and then click again. If you have multiple Sitemaps for a custom search engine, you can submit your most important Sitemap first, wait 24 hours, and then submit the next Sitemap. You can keep submitting the rest of your Sitemaps in 24-hour cycles.

As time passes, new pages on your site will eventually get crawled and included in the main Google index. This frees up your on-demand indexing quota so you can submit new pages.

Back to top

Hosting the Annotations Files Yourself

You can host annotations files on your own server instead of uploading them in the Control Panel. If this useful if you:

  • Update the annotations frequently.
  • Manage the annotations files without using the Control Panel.
  • Use scripts to create custom search engines.

    If you have fast-changing data, you could use scripts to convert XML output into XML annotations files, and Custom Search will just grab the updated annotation data from your site. Your script could get data from anywhere, such as a database, RSS feeds, Atom feeds, iCal feeds, and Open Directory.

To host and manage the annotations files on your website, you must tell Custom Search where to find them by creating and uploading a root annotations file that points to the hosted files.

Here's an example of a root annotation that refers to an annotations file hosted on a website:

<GoogleCustomizations>
  <Include type="Annotations" href="http://www.yoursite.com/cse_bacon_annotations.xml" />
</GoogleCustomizations>

Your root annotations file does not have to be sparse. You can have one or more full-blown annotations files that refers to other annotations files.

The following example refers to a hosted annotations file inside a full-blown annotations file.

<GoogleCustomizations>
  <Annotations file="livercancer-annotations.xml">
    <Annotation about="www.cancer.gov/cancertopics/types/liver/*">
      <Label name="_cse_Ansi-stoubiq"/>
      <Label name="symptoms"/>
      <Comment>This labels this url as symptoms.</Comment>
    </Annotation>
  </Annotations>
  <Include type="Annotations" href="http://mysite.com/myannofile.xml" /> 
</GoogleCustomizations>

Note: Include is a child element of GoogleCustomizations, not Annotations.

You can have up to five levels of nested Include tags. Regardless of your nesting structure, you can include up to 50 annotations files.

Back to top

Annotations Limits

The following table lists the limits for annotations files that are uploaded to Custom Search and annotations files that are hosted on your website:

Note: Follow the limits closely; if you exceed them, your search engine might not show results.

Maximum allowed Hosted on Google Custom Search Hosted on your site
File size (context or annotations files) 30KB 3MB
Number of files As many files as you need, so long as you do not exceed the global annotations limit (5,000) 50
Number of annotations per file 2,000 5,000, so long as the file size does not exceed 3MB and the total size of all the files do not exceed 10MB
Total number of annotations for all your search engines 5,000

Tip: If you find your search engines outgrowing the large 5,000-site limit, consider consolidating individual URLs into URL patterns.

5,000, so long as the aggregate size of all files does not exceed 10MB

Back to top

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.