Webmasters

How do I create an HTML snapshot?

Generating HTML snapshots

As long as your app observes the URL mapping and HTML snapshot scheme described in the Specification section, your app is visible to the crawler. It does not matter whether

The only requirement is that your app uses AJAX. But otherwise you wouldn't be reading this.

It's impossible to document all the different ways web applications can be developed. There are too many web frameworks, libraries, and so forth to enumerate. Likewise, there are many ways to create HTML snapshots, and which way is best for your application depends entirely on how your web app is implemented. Let's discuss some common scenarios.

1. Legacy method: create static content for each dynamic state

Webmasters have used techniques such as HIJAX to serve static content for dynamic state. If you already have this content, or if your app's content does not change frequently, returning static content is still a valid option. However, for AJAX URLs to be displayed in search results rather than static HTML URLs, the main modifications still required are to adopt the #! AJAX URL structure and properly respond to _escaped_fragment_ requests.

2. Much of your content is created by a server-side technology such as PHP or ASP.NET

It may be the case that your application uses AJAX for improved interactivity, but much of the content is actually produced with a server-side technology such as PHP or ASP.NET. For example, consider the fictitious movie information application:

As you can see, the URL contains a hash fragment, #!tab0&q=Octopus+spotting. Behind the scenes, your application uses JavaScript to process the hash fragment and fire off an XHR request, which in turn could call a PHP script. The PHP script would then look up the relevant content and produce the HTML that will then be added to the page without a full refresh. For example, if you use jQuery, you may parse the hash fragment and then send off an XHR as follows:

    function getResults(params, urlToGet) {
        // urlToGet will be something like getdata.php?tab=tab0&q=Octopus+spotting
        $.ajax({
            type: "POST",
            url: urlToGet,
            dataType: "json",
            success: function(msg){
              updatePage(params, msg);
            }
        });
    }

In this case, although the HTML containing your content (in this case, the description of the fictitious octopus movie) is actually created on the server rather than dynamically on the browser, it will still be invisible to the crawler, because it is added dynamically, based on a hash fragment, and by using an XHR. So how can you make this application work with the AJAX crawling scheme? Remember that your app will need to produce an HTML snapshot whenever it gets a request for an ugly URL, that is, a URL containing a query parameter with the name _escaped_fragment_. You can do this by having a script, say, gethtmlsnapshot.php, that will produce the snapshot using the PHP scripts that already exist (in this case, getdata.php) to fill in the content. The role of gethtmlsnapshot.php is to mimic what the JavaScript would produce on your page. In our case, it will mimic the JavaScript functions getResults and updatePage (as mentioned in the JavaScript snippet above). For example, you may want to use the DOM support in PHP to load the static portions of your page, then add the content that would normally be added by JavaScript. The following code snippet illustrates this idea:

  // load the html page
  $remote = file_get_contents('index-movies.html');
  $doc = new DomDocument();
  $file = $doc->loadHTML($remote);

  // get the _escaped_fragment_ parameter
  $escapedfragment = $_GET['_escaped_fragment_'];
  // NOTE: VALIDATE PARAMETERS (as always, to avoid security risks)
  $params = getParams($unescapedfragment);

  // you can use the same php script that your JavaScript calls
  $dataToAdd = include('getdata.php');

  // now add it to the right place in the DOM
  $docToAdd = new DomDocument();
  $docToAdd->loadHTML($dataToAdd['summary']);
  $newNode = $doc->importNode($docToAdd->documentElement, true);
  $doc->getElementById('load')->appendChild($newNode);

  // finally, save as HTML
  echo $doc->saveHTML();

3. Much of your content is created on the client-side with JavaScript

If a lot of your content is created in JavaScript, you may want to consider using a technology such as a headless browser to create an HTML snapshot. For example, use HtmlUnit. HtmlUnit is a Java-only headless browser emulator. If your application follows a standard war setup, you will need to include HtmlUnit and all its dependencies in your WEB-INF/lib directory in order to use it. If your server-side technology is a Java technology, you can incorporate HtmlUnit into your application with a servlet filter (remember to update your web.xml file to reflect the additional servlet filter!). The servlet filter will check whether the URL contains an _escaped_fragment_. If so, it will invoke HtmlUnit to create the HTML snapshot to return to the crawler. Otherwise, the filter will delegate to the next servlet (or servlet filter) in the chain. HtmlUnit can be invoked with a few lines of code:


public final class CrawlServlet implements Filter {

  public void doFilter(ServletRequest request, ServletResponse response,
      FilterChain chain) throws IOException {
      ...
      if ((queryString != null) && (queryString.contains("_escaped_fragment_"))) {
       ...
       // rewrite the URL back to the original #! version
       // remember to unescape any %XX characters
       url_with_hash_fragment = rewriteQueryString(url_with_escaped_fragment);

       // use the headless browser to obtain an HTML snapshot
       final WebClient webClient = new WebClient();
       HtmlPage page = webClient.getPage(url_with_hash_fragment);

       // important!  Give the headless browser enough time to execute JavaScript
       // The exact time to wait may depend on your application.
       webClient.waitForBackgroundJavaScript(2000);

       // return the snapshot
       out.println(page.asXml());
     } else {
      try {
        // not an _escaped_fragment_ URL, so move up the chain of servlet (filters)
        chain.doFilter(request, response);
      } catch (ServletException e) {
        System.err.println("Servlet exception caught: " + e);
        e.printStackTrace();
      }
    }
    ...
  }
}

If your server-side technology is not Java, you may still want to use a headless browser technology to produce HTML snapshots. For example, PHP offers integration with Java (or try the php/Java bridge), which you could use to invoke HtmlUnit. You can also use JSON to achieve interoperability between your server-side technology and HtmlUnit.

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.