Annotations: Defining Sites to Search

This page describes how to define the coverage of your search engine using a TSV file or XML annotations file.

  1. Overview
  2. Choosing the Right Format
  3. Using the OPML Format
  4. Using the TSV Format
  5. Using the Custom Search XML Format
  6. Improving Search Coverage
  7. Improving Search Freshness
  8. Annotations Limits

Overview

Adding sites individually using the Custom Search Control Panel can be tedious if you're building a large search engine. In addition, managing a large collection of sites in the Control Panel isn't fun either. Instead, you can add and manage a lot of sites by listing them in an annotations file and uploading it. In addition, annotations files—particularly XML ones—give you far greater control over the ranking of search results.

An annotations file is simply a list of annotations. Each annotation has two components: the site and its associated labels. The label tells Custom Search how to handle a site; that is, whether a site should be included, excluded, promoted, or demoted. In the context file, you define labels; in the annotations file, you tag sites with the appropriate labels.

Annotations files can be in any of the following formats:

When you start editing your annotations file, start out with a small number of annotations, and then test some search queries in the Preview tab of the Control Panel. It's easier to test and troubleshoot your search engine with a handful of annotations. When you get the results that you expect, incrementally add more annotations.

You can upload the annotations file to the Control Panel. For details about file limits, see the Annotations Limits section.

Back to top

Choosing the Right Format

Before you start creating annotations, determine which file format best suits your needs. If your search engine increases in complexity, you can consider using multiple annotations files, even files of different formats. For example, you can upload OPML annotations files generated by other sites and XML annotations files you created. Custom Search combines all the annotations files in all your search engines into a single XML annotations file.

Use the following table to pick the appropriate format:

To create Use Because Limitations More information
A search engine with an existing OMPL file (feed-based search engine) OPML format You do not need to recreate annotations if you already have OPML files with URL patterns. You can upload the existing file directly to the Control Panel. You cannot directly fine-tune the ranking of search results. Using the OPML format
A search engine that does not need all the advanced features TSV format You can create and manage the annotations in a more readable format.

You can use a spreadsheet editor.

You can take advantage of many advanced features, such as applying labels, associating scores, adding comments.

You can create your own attributes. However, they are mostly for your own use; Custom Search does not do anything with them.
You cannot refer to another Custom Search file, and this is not the best option for programmatically created search engines. Using the TSV format
A complex and heavily customized search engine Custom Search XML format It's the most powerful format. It is appropriate for developers who want to create advanced search engines with bells and whistles. It gives you more flexibility and greater control over the ranking of your search results. It is the most complex format. Using the XML format

Back to top

Using the OPML Format

OPML is a type of XML format that was originally developed for defining ordered lists of elements or outlines, but it is now also commonly used for web feeds. OPML specification.

If you have OPML files from some feed aggregators, you can upload the OPML file without bothering with typing each site. Custom Search grabs the value of the OPML attribute htmlUrl and adds it to the list of sites to search. You can upload multiple OPML files for each of your search engines.

Here's an example of an OPML file:

<opml version="1.0">
  <head>
    <title>Bicycles</title>
    <dateCreated>Fri Mar 14 23:21:11 PDT 2008</dateCreated>
    <dateModified>Fri Mar 14 23:21:11 PDT 2008</dateModified>
  </head>
  <body>
    <outline type="rss" text="Road Bikes" xmlUrl="http://www.google.com/exampleurl.opml" 
         htmlUrl="http://www.google.com/sampleurl1.opml"/>
    <outline type="rss" text="Mountain Bikes" xmlUrl="http://www.google.com/exampleurl2.opml"
         htmlUrl="http://www.google.com/sampleurl2.opml"/>
  </body>
</opml>

When you upload an OPML file in the Control Panel, Custom Search automatically converts OPML to Custom Search XML. It adds search engine labels (<Label name="_cse_example"/>) and scores (score="1"). Mpre information about scores.

The following is an example of an OPML file that has been converted to have Custom Search XML:

<GoogleCustomizations>
   <Annotations>
     <Annotation about="www.google.com/exampleurl1.opml" score="1">
       <Label name="_cse_example"/>
     </Annotation>
     <Annotation about="www.google.com/exampleurl2.opml" score="1">
       <Label name="_cse_example"/>
     </Annotation>
   </Annotations>
</GoogleCustomizations>

Back to top

Using the TSV Format

You can create annotations using a text file with tab-separated values (TSV).

You can use a plain text editor or a spreadsheet editor to create the file. It does not matter what you name the file, so long as you save it with the file extension .tsv (for example, cse_bicycles.tsv). If you are using a plain text editor, separate each element by a single tab character. Do not try to prettify and align the lines with multiple tab characters. If you are using a spreadsheet editor, allocate a column for each of the fields.

Each line of text in your TSV file can list a site and its associated labels.

Elements of a Custom Search TSV

Your TSV files must begin with a heading that enumerates the fields that you will be using in the subsequent annotation lines. The headings are case-sensitive, so follow the capitalization in this guide. The order of the heading elements doesn't really matter, but the annotation lines that follow the heading must follow the order of the headings. When you create the headings, you are essentially creating columns of data, so you can't just plug the annotation data any which way.

A heading has the following fields:

  • URL - The URL pattern of the site.

  • Label - The search engine label or refinement label that should be applied to the site. You can get the labels for your search engine from the Context section of the Advanced tab in the Control Panel. You'll find at least two search engine or background labels: one for adding sites to your custom search engine and one for excluding sites from it. If you have not changed the search engine label, the label for including sites is in the form of _cse_xxxxxxxxxxx, where x is a character, and the label for excluding sites is in the form of _cse_exclude_xxxxxxxxxxx. To avoid errors, copy and paste these labels instead of typing them by hand.
  • Comment - Optional. Notes about each annotation.
  • Score - Optional. Discussed in detail in the Ranking Search Results page.
  • Custom Field - Optional. Your own attributes. To create an attribute, just prefix it with "A=". For example, to create a date attribute, use "A=Date". Custom Search does not process these fields.

Each subsequent line corresponds to an annotation. It provides the values for the fields that were defined in the headings.

Back to top

TSV Example

Let's look at an example of a basic TSV file.

URL        Label
www.webmd.com/hw/*     _cse_Ansi-stoubiq
www.webmd.com/hw/cancer/*     _cse_exclude_Ansi-stoubiq

The example has a heading with the two required fields: URL and Label. The two annotation lines supply the values for the fields. The label in the first annotation line, _cse_Ansi-stoubiq, adds the site, www.webmd.com/hw/*, to the search engine. The other label, _cse_exclude_Ansi-stoubiq, excludes the site, www.webmd.com/hw/cancer/*, from the search engine.

You can add more fields to your TSV annotations. Here's an exmple that includes a Comment field and a custom field, A=Date.

URL     Label     Comment     A=Date
www.cancer.gov/cancertopics/types/liver/*     _cse_Ansi-stoubiq     government site     20060504
www.medicinenet.com/liver_cancer/*     _cse_Ansi-stoubiq     site on symptoms     20060504
www.webmd.com/hw/cancer/*     _cse_Ansi-stoubiq     great site for patients!     20060504
www.oncologychannel.com/*/treatment     _cse_Ansi-stoubiq     20060504

Even though you added new fields in the header, you are not obligated to supply the values for all them, which is why it's fine for the last line to not have a comment. But that's not the case for URL and Label, which are required fields.

Back to top

Using the Custom Search XML Format

If you want to take advantage of all the features available in the Custom Search API, XML is the way to go. You can create XML annotations files in three ways. The following table describes the different strategies for the XML format. It's just a matter of preference, so you should not worry too much about picking the right way to annotate your search engine. If you change your mind, you can reorganize your annotations by cutting and pasting.

To do this Use Because Limitations More information
Keep the annotations for each search engine separate An external annotations file for each search engine Custom Search merges all annotations into a single annotations file, but you can create and upload them separately. Each file pertains to a search engine. If there's overlap between search engines, you might end up managing the same sites in multiple places. XML Annotations
Pool all annotations across all your search engines in a single place One external annotations file shared by all search engines Having all annotations in a single file lets you manage annotations across all search engines.

A communal annotations file enables you to list sites only once, yet have the flexibility to change inclusion, exclusion, and ranking of the same sites for various search engines.

For example, one of your search engines could restrict its search to five sites, another could eliminate those sites, and yet another could promote those sites.

If you have a lot of annotations, it could be hard to manage the file. You always have to verify that you are changing the annotations for the right search engine.

XML Annotations

When you upload your files in the Control Panel, Custom Search merges all your annotations into a single annotations file that is shared by all your search engines. This is the annotations file you download from the Control Panel. You can distinguish the annotations by their search engine labels (the value in the Label element and the name attribute).

<Annotation about="http://www.solarenergy.org/*">
   <Label name="_cse_abcdefghijk"/>
</Annotation>

If you prefer to keep the annotations for each search engine separate, you should maintain the original annotations files and upload them to the Control Panel when you make changes. To keep things simple, stick with using the XML format. Do not alternate between using the XML format and the Sites tab in the Control Panel to include or exclude sites, because changes made to the Sites tab are appended to the communal annotations file and you'll have to copy these new annotations to your copy of the annotations file.

Back to top

XML Annotations

The following is an example of XML annotations. It is roughly the XML version of the TSV example in the previous section. It includes the same elements, except for custom attributes, which are available only in the TSV format. This annotations file tells Custom Search to include everything under www.webmd.com/hw/* but exclude everything under www.webmd.com/hw/cancer/*.

<Annotations>
  <Annotation about="www.cancer.gov/cancertopics/types/liver/*">
    <Label name="_cse_Ansi-stoubiq"/>
    <Comment>government site</Comment>
  </Annotation>
  <Annotation about="www.medicinenet.com/liver_cancer/">
    <Label name="_cse_exclude_Ansi-stoubiq"/>
    <Comment>site on symptoms</Comment>
  </Annotation>
  <Annotation about="www.webmd.com/hw/cancer/*">
    <Label name="_cse_exclude_Ansi-stoubiq"/>
    <Comment>great sites for patients!</Comment>
  </Annotation>
  <Annotation about="www.oncologychannel.com/*/treatment">
    <Label name="_cse_exclude_Ansi-stoubiq"/>
  </Annotation>
</Annotations>

The annotations file has four elements in the following hierarchy:

  • Annotations (root element)
    • Annotation
      • Label
      • Comment (optional)

Back to top

Creating External Annotations

To list sites you want your search engine to cover, do the the following:

  1. Start the file with the <Annotations></Annotations> root element.
  2. Create an annotation by adding the <Annotation></Annotation> tags, and then define the about attribute with the URL pattern of the site.
    <Annotations>
       <Annotation about="www.webmd.com/hw/cancer/*">
       </Annotation>
       </Annotations>
    
  3. Associate the site with the search engine by using the <Label name=" "/> tag, and specify how that site should be treated by the search engine. You can get the labels for your search engine from the Context section of the Advanced tab in the Control Panel. You'll find two labels: one for adding sites to your custom search engine and one for excluding sites from it. If you have not changed the name of the search engine label in the context file, the label for including sites is in the form of _cse_xxxxxxxxxxx, where x is a character, and the label for excluding sites is in the form of _cse_exclude_xxxxxxxxxxx. To avoid errors, copy and paste these labels instead of typing them by hand.
       <Annotations>
       <Annotation about="http://www.solarenergy.org/*">
         <Label name="_cse_abcdefghijk"/>
       </Annotation>
    </Annotations>
    ;

    A single site can have multiple labels associated with it, and it can be treated differently by different search engines. For example, the site http://www.solarenergy.org/ can be included in both your solar energy search engine and excluded from your bike search engine. The same site will be ranked differently in the result pages of different search engines.

    If you have changed the name of the label in the context file, remember to update the Label name values in your annotations files.

  4. To add more sites, create and define another Annotation element.
  5. Save the XML file.

Back to top

Improving Search Coverage

Custom Search is built on top of the Google index. This means that webpages that are in the Google index are available to your search engine; conversely, webpages that have not been crawled by Google will not show up in your search results. If you want your custom search engine to include sites that are not in the Google index, submit a Sitemap to the Indexing tab of the Control Panel or directly to Google Search Console.

A Sitemap includes a list of pages in your site, as well as information about the update frequency of the webpages and their importance relative to each other. Submitting a Sitemap helps Google discover your webpages and improve the crawling schedule. To learn more about Sitemaps, see the Webmaster Help Center and Using the Sitemap Protocol. If you are interested in building fancier Sitemaps, see http://www.sitemaps.org/protocol.php.

Submitting Sitemaps is particularly helpful if your site has the following:

  • Dynamic content
  • Webpages that aren't easily discovered by Googlebot (Google's web crawler), such as pages with rich AJAX or Flash features
  • Few websites linking to it.

    Googlebot crawls the web by following links from one page to another, so if your site isn't well linked, it is hard for the crawler to discover it. If your website is new, probably not many websites are pointing to your site.

  • A large archive of content pages that does not have a strong network of cross-linking

Google can index only pages it can access. So, if you use robots.txt file or robots meta tags in your webpages, make sure those pages don't block crawlers.

Improved coverage is not instantaneous, as it takes some time for the pages to be crawled and indexed. But once your webpages are in the index, they could appear in both Google search and your custom search engine.

Back to top

Improving Search Freshness

If you are not creating or maintaining a search engine that searches just your website, you can skip this section. You cannot apply the strategies discussed in this section to websites that you do not own or manage.

After you submit a Sitemap, Google will start crawling some or all of the webpages, and, over time, the search results for your website would improve. But if you can't wait and you want certain webapges crawled and indexed within the next 24 hours, you can expedite the crawling of your most important webpages by going to Indexing tab of the Control Panel and clicking the Index Now button under the Indexing section.

Note: Custom Search and Google search use different selection criteria, therefore submitting pages for on-demand indexing will not make them appear any faster in the Google search index.

For each search engine in your account, you can submit one Sitemap for on-demand indexing. Custom Search will crawl 10 webpages that you have marked with the highest priority values in your Sitemap. If more than 10 webpages have the highest priority values, Custom Search will crawl the highest priority pages with the most recent last modified date. If you have upgraded to Google Site Search, you have a higher limit for on-demand indexing. The limit, which starts from 50 webpages, varies according to your account level.

If you have more than 10 webpages that you want indexed immediately, you can resubmit an updated Sitemap with the next 10 webpages marked with the highest priority value. If you had just clicked the Index Now button in the Control Panel, wait 24 hours and then click again. If you have multiple Sitemaps for a custom search engine, you can submit your most important Sitemap first, wait 24 hours, and then submit the next Sitemap. You can keep submitting the rest of your Sitemaps in 24-hour cycles.

As time passes, new pages on your site will eventually get crawled and included in the main Google index. This frees up your on-demand indexing quota so you can submit new pages.

Back to top

Annotations Limits

The following table lists the limits for annotations files that are uploaded to Custom Search:

Note: Follow the limits closely; if you exceed them, your search engine might not show results.

Aspect Limit
File size (context or annotations files) 30KB
Number of files As many files as you need, so long as you do not exceed the global annotations limit (5,000)
Number of annotations per file 2,000
Total number of annotations for all your search engines 5,000

Tip: If you find your search engines outgrowing the large 5,000-site limit, consider consolidating individual URLs into URL patterns.

Back to top

Send feedback about...

Custom Search