Annotations: Defining Sites to Search

This page describes how to define the coverage of your search engine using a TSV file or XML annotations file.

  1. Overview
  2. Choosing the Right Format
  3. Using the OPML Format
  4. Using the TSV Format
  5. Using the Programmable Search XML Format
  6. Improving Search Coverage
  7. Annotations Limits

Overview

Adding sites individually using the Programmable Search Engine Control Panel can be tedious if you're building a large search engine. In addition, managing a large collection of sites in the Control Panel isn't fun either. Instead, you can add and manage a lot of sites by listing them in an annotations file and uploading it. In addition, annotations files—particularly XML ones—give you far greater control over the ranking of search results.

An annotations file is simply a list of annotations. Each annotation has two components: the site and its associated labels. The label tells Programmable Search Engine how to handle a site; that is, whether a site should be included, excluded, promoted, or demoted. In the context file, you define labels; in the annotations file, you tag sites with the appropriate labels.

Annotations files can be in any of the following formats:

When you start editing your annotations file, start out with a small number of annotations, and then test some search queries in the Preview tab of the Control Panel. It's easier to test and troubleshoot your search engine with a handful of annotations. When you get the results that you expect, incrementally add more annotations.

You can upload the annotations file to the Control Panel. For details about file limits, see the Annotations Limits section.

Back to top

Choosing the Right Format

Before you start creating annotations, determine which file format best suits your needs. If your search engine increases in complexity, you can consider using multiple annotations files, even files of different formats. For example, you can upload OPML annotations files generated by other sites and XML annotations files you created. Programmable Search Engine combines all the annotations files in all your search engines into a single XML annotations file.

Use the following table to pick the appropriate format:

To create Use Because Limitations More information
A search engine with an existing OMPL file (feed-based search engine) OPML format You do not need to recreate annotations if you already have OPML files with URL patterns. You can upload the existing file directly to the Control Panel. You cannot directly fine-tune the ranking of search results. Using the OPML format
A search engine that does not need all the advanced features TSV format You can create and manage the annotations in a more readable format.

You can use a spreadsheet editor.

You can take advantage of many advanced features, such as applying labels, associating scores, adding comments.

You can create your own attributes. However, they are mostly for your own use; Programmable Search Engine does not do anything with them.
You cannot refer to another Programmable Search Engine file, and this is not the best option for programmatically created search engines. Using the TSV format
A complex and heavily customized search engine Programmable Search XML format It's the most powerful format. It is appropriate for developers who want to create advanced search engines with bells and whistles. It gives you more flexibility and greater control over the ranking of your search results. It is the most complex format. Using the XML format

Back to top

Using the OPML Format

OPML is a type of XML format that was originally developed for defining ordered lists of elements or outlines, but it is now also commonly used for web feeds. OPML specification.

If you have OPML files from some feed aggregators, you can upload the OPML file without bothering with typing each site. Programmable Search Engine grabs the value of the OPML attribute htmlUrl and adds it to the list of sites to search. You can upload multiple OPML files for each of your search engines.

Here's an example of an OPML file:

<opml version="1.0">
  <head>
    <title>Bicycles</title>
    <dateCreated>Fri Mar 14 23:21:11 PDT 2008</dateCreated>
    <dateModified>Fri Mar 14 23:21:11 PDT 2008</dateModified>
  </head>
  <body>
    <outline type="rss" text="Road Bikes" xmlUrl="http://www.google.com/exampleurl.opml" 
         htmlUrl="http://www.google.com/sampleurl1.opml"/>
    <outline type="rss" text="Mountain Bikes" xmlUrl="http://www.google.com/exampleurl2.opml"
         htmlUrl="http://www.google.com/sampleurl2.opml"/>
  </body>
</opml>

When you upload an OPML file in the Control Panel, Programmable Search Engine automatically converts OPML to Programmable Search XML. It adds search engine labels (<Label name="_cse_example"/>) and scores (score="1"). More information about scores.

The following is an example of an OPML file that has been converted to have Programmable Search XML:

<GoogleCustomizations>
   <Annotations>
     <Annotation about="www.google.com/exampleurl1.opml" score="1">
       <Label name="_cse_example"/>
     </Annotation>
     <Annotation about="www.google.com/exampleurl2.opml" score="1">
       <Label name="_cse_example"/>
     </Annotation>
   </Annotations>
</GoogleCustomizations>

Back to top

Using the TSV Format

You can create annotations using a text file with tab-separated values (TSV).

You can use a plain text editor or a spreadsheet editor to create the file. It does not matter what you name the file, so long as you save it with the file extension .tsv (for example, cse_bicycles.tsv). If you are using a plain text editor, separate each element by a single tab character. Do not try to prettify and align the lines with multiple tab characters. If you are using a spreadsheet editor, allocate a column for each of the fields.

Each line of text in your TSV file can list a site and its associated labels.

Elements of a Programmable Search Engine TSV

Your TSV files must begin with a heading that enumerates the fields that you will be using in the subsequent annotation lines. The headings are case-sensitive, so follow the capitalization in this guide. The order of the heading elements doesn't really matter, but the annotation lines that follow the heading must follow the order of the headings. When you create the headings, you are essentially creating columns of data, so you can't just plug the annotation data any which way.

A heading has the following fields:

  • URL - The URL pattern of the site.

  • Label - The search engine label or refinement label that should be applied to the site. You can get the labels for your search engine from the Context section of the Advanced tab in the Control Panel. You'll find at least two search engine or background labels: one for adding sites to your Programmable Search Engine and one for excluding sites from it. If you have not changed the search engine label, the label for including sites is in the form of _cse_xxxxxxxxxxx, where x is a character, and the label for excluding sites is in the form of _cse_exclude_xxxxxxxxxxx. To avoid errors, copy and paste these labels instead of typing them by hand.
  • Comment - Optional. Notes about each annotation.
  • Score - Optional. Discussed in detail in the Ranking Search Results page.
  • Custom Field - Optional. Your own attributes. To create an attribute, just prefix it with "A=". For example, to create a date attribute, use "A=Date". Programmable Search Engine does not process these fields.

Each subsequent line corresponds to an annotation. It provides the values for the fields that were defined in the headings.

Back to top

TSV Example

Let's look at an example of a basic TSV file.

URL        Label
www.webmd.com/hw/*     _cse_Ansi-stoubiq
www.webmd.com/hw/cancer/*     _cse_exclude_Ansi-stoubiq

The example has a heading with the two required fields: URL and Label. The two annotation lines supply the values for the fields. The label in the first annotation line, _cse_Ansi-stoubiq, adds the site, www.webmd.com/hw/*, to the search engine. The other label, _cse_exclude_Ansi-stoubiq, excludes the site, www.webmd.com/hw/cancer/*, from the search engine.

You can add more fields to your TSV annotations. Here's an exmple that includes a Comment field and a custom field, A=Date.

URL     Label     Comment     A=Date
www.cancer.gov/cancertopics/types/liver/*     _cse_Ansi-stoubiq     government site     20060504
www.medicinenet.com/liver_cancer/*     _cse_Ansi-stoubiq     site on symptoms     20060504
www.webmd.com/hw/cancer/*     _cse_Ansi-stoubiq     great site for patients!     20060504
www.oncologychannel.com/*/treatment     _cse_Ansi-stoubiq     20060504

Even though you added new fields in the header, you are not obligated to supply the values for all them, which is why it's fine for the last line to not have a comment. But that's not the case for URL and Label, which are required fields.

Back to top

Using the Programmable Search XML Format

If you want to take advantage of all the features available in the Custom Search JSON API, XML is the way to go.

XML Annotations

The following is an example of XML annotations. It is roughly the XML version of the TSV example in the previous section. It includes the same elements, except for custom attributes, which are available only in the TSV format. This annotations file tells Programmable Search Engine to include everything under www.webmd.com/hw/* but exclude everything under www.webmd.com/hw/cancer/*.

<Annotations>
  <Annotation about="www.cancer.gov/cancertopics/types/liver/*">
    <Label name="_cse_Ansi-stoubiq"/>
    <Comment>government site</Comment>
  </Annotation>
  <Annotation about="www.medicinenet.com/liver_cancer/">
    <Label name="_cse_exclude_Ansi-stoubiq"/>
    <Comment>site on symptoms</Comment>
  </Annotation>
  <Annotation about="www.webmd.com/hw/*">
    <Label name="_cse_Ansi-stoubiq"/>
    <Comment>great sites for patients!</Comment>
  </Annotation>
  <Annotation about="www.webmd.com/hw/cancer/*">
    <Label name="_cse_exclude_Ansi-stoubiq"/>
    <Comment>great sites for patients!</Comment>
  </Annotation>
  <Annotation about="www.oncologychannel.com/*/treatment">
    <Label name="_cse_exclude_Ansi-stoubiq"/>
  </Annotation>
</Annotations>

The annotations file has four elements in the following hierarchy:

  • Annotations (root element)
    • Annotation
      • Label
      • Comment (optional)

Back to top

Creating External Annotations

To list sites you want your search engine to cover, do the the following:

  1. Start the file with the <Annotations></Annotations> root element.
  2. Create an annotation by adding the <Annotation></Annotation> tags, and then define the about attribute with the URL pattern of the site.
    <Annotations>
       <Annotation about="www.webmd.com/hw/cancer/*">
       </Annotation>
       </Annotations>
    
  3. Associate the site with the search engine by using the <Label name=" "/> tag, and specify how that site should be treated by the search engine. You can get the labels for your search engine from the Context section of the Advanced tab in the Control Panel. You'll find two labels: one for adding sites to your Programmable Search Engine and one for excluding sites from it. If you have not changed the name of the search engine label in the context file, the label for including sites is in the form of _cse_xxxxxxxxxxx, where x is a character, and the label for excluding sites is in the form of _cse_exclude_xxxxxxxxxxx. To avoid errors, copy and paste these labels instead of typing them by hand.
       <Annotations>
       <Annotation about="http://www.solarenergy.org/*">
         <Label name="_cse_abcdefghijk"/>
       </Annotation>
    </Annotations>
    ;

    A single site can have multiple labels associated with it,

    If you have changed the name of the label in the context file, remember to update the Label name values in your annotation file.

  4. To add more sites, create and define another Annotation element.
  5. Save the XML file.

Back to top

Improving Search Coverage

Programmable Search Engine is built on top of the Google index. This means that webpages that are in the Google index are available to your search engine; conversely, webpages that have not been crawled by Google will not show up in your search results. If you want your Programmable Search Engine to include sites that are not currently in the Google index, submit a Sitemap to Google Search Console.

A Sitemap includes a list of pages in your site, as well as information about the update frequency of the webpages and their importance relative to each other. Submitting a Sitemap helps Google discover your webpages and improve the crawling schedule. To learn more about Sitemaps, see the Webmaster Help Center and Using the Sitemap Protocol. If you are interested in building fancier Sitemaps, see http://www.sitemaps.org/protocol.php.

Submitting Sitemaps is particularly helpful if your site has the following:

  • Dynamic content
  • Webpages that aren't easily discovered by Googlebot (Google's web crawler), such as pages with rich AJAX or Flash features
  • Few websites linking to it.

    Googlebot crawls the web by following links from one page to another, so if your site isn't well linked, it is hard for the crawler to discover it. If your website is new, probably not many websites are pointing to your site.

  • A large archive of content pages that does not have a strong network of cross-linking

Google can index only pages it can access. So, if you use robots.txt file or robots meta tags in your webpages, make sure those pages don't block crawlers.

Improved coverage is not instantaneous, as it takes some time for the pages to be crawled and indexed. But once your webpages are in the index, they could appear in both Google search and your Programmable Search Engine.

Back to top

Annotations Limits

The following table lists the limits for annotations files that are uploaded to Programmable Search Engine:

Note: Follow the limits closely; if you exceed them, your search engine might not show results.

Aspect Limit
File size (context or annotations files) 30KB
Number of annotations per file 2,000
Maximum number of annotations per search engine 5,000

Tip: If you find your search engine outgrowing the large 5,000-site limit, consider consolidating individual URLs into URL patterns.

Back to top