Google Apps Script

Tutorial: Parsing an XML Document

Evin Levey, Google Apps Script Product Manager
May, 2010

Goal

This is an advanced tutorial that guides you through using Google Apps Script to parse an XML document and manipulate the data retrieved from the XML document. The tutorial uses the Oracle of Bacon web site, which contains data for determining how different actors are connected to each other by films in which they've appeared. You'll learn how to use the classes and methods of the Xml Services.

Time to Complete

Approximately 10 to 20 minutes, depending on your level of familiarity with JavaScript and XML.

Prerequisites

This tutorial will be easier if you are already familiar with the following:

  • XML and HTML principles
  • JavaScript objects

If you're not, don't worry. We'll explain these concepts as we go along.

Also, to use XML parsing, you need to know the structure of the XML content that your script will parse. If the XML data are in a flat file, you can examine the file to see the element names. If the XML data are accessible from a web browser, use the browser's View Source function or, in Google Chrome, the Inspect Element menu choice on the context menu.

Index

This tutorial contains the following sections:

  1. Setting Up the Spreadsheet and Script
  2. Running the Script
  3. How the Script Works
  4. The Full Kevin Bacon Script
  5. Summary

Setting Up the Spreadsheet and Script

  1. Open a new spreadsheet in Google Docs.
  2. Save the spreadsheet by giving it the name Kevin Bacon.
  3. Choose Tools > Script Gallery.
  4. On the Script Gallery dialog box, locate the the Six Degrees of Kevin Bacon Calculator script by typing in the following search term:

    Kevin Bacon

  5. Click Install. You see a message that the script was successfully installed.
  6. Click Close.

Running the Script

To run the script, type the following in any cell of the spreadsheet, where actor_name1 and actor_name2 are the names of any two actors, and then press Enter:

=kb("actor_name1","actor_name2")

For example:

=kb("Ellen DeGeneres","Kevin Bacon")

The same values can return different results. You might get this for Ellen DeGeneres and Kevin Bacon:

Ellen DeGeneres was in Doctor Dolittle (1998) with Pruitt Taylor Vince who was in Trapped (2002) with Kevin Bacon

Or you might get this:

Kevin Bacon was in Murder in the First (1995) with Stephen Tobolowsky who was in Trevor (1994) with Ellen DeGeneres

Or you might get some other result.

If you leave the second argument, actor_name2 , blank, the script defaults to the value Kevin Bacon.

If you enter a name for which the Oracle of Bacon has no entry or has more than one entry, you see an error message:

Cannot find actor: actor_name

Our script can't detect and make a choice when there are two or more entries at the Oracle of Bacon site, so you see the error message.

How the Script Works

The Oracle of Bacon stores information in the XML format. XML, or eXtensible Markup Language, describes data in a text format using elements to define each type of data. The elements are similar to HTML tags such as <i> and <b>, but instead of defining formatting, they define data. For example, an XML database of paintings might contain elements for artist name, artist birth date, artist death date, painting title, and year painted.

The Oracle of Bacon XML database contains information about actors and the films in which they appear. The Oracle of Bacon script accepts the actor names as input, retrieves from the Oracle of Bacon the names of the movies in which the actors appeared and the names of the other actors in the movies, and then determines the films and actors connecting the first actor to the second.

Let's step through some of the important code in the script.

The first function in the script obtains the names a user puts in the spreadsheet, then it retrieves data from the Oracle of Bacon. To see the code, go to the Kevin Bacon spreadsheet you saved in the previous section, and click Tools > Scripts > Script editor.

Here's the code that obtains the names as the arguments from and to :

function kb(from, to) {
  if (!to) {
    to = "Kevin Bacon";
  }

The following code block defines the data to retrieve from the Oracle of Bacon. The from and to arguments contain the names of the actors you are trying to connect. The payload statement contains search terms passed by the HTTP POST method to the CGI service running at the Oracle of Bacon web site.

  var parameters = {
     method : "post",
     payload :
    "a=" + encodeURIComponent(from) +
     "&b=" + encodeURIComponent(to) +
     "&u=1" +  // movies only. Use "3" to include TV
     "&p=" + encodeURIComponent('google-apps')
   };
  

The method argument specifies which HTTP request method is being used. In this case, we're making an HTTP POST request. The payload argument contains the body of the POST request. In this case, it consists of the four name-value pairs that we need to specify in the request to the Oracle of Bacon. Here's what they mean:

  • "a=" + encodeURIComponent(from) requests the first actor.
  • "&b=" + encodeURIComponent(to) requests the second actor.
  • "&u=1" requests movies either of the two actors appeared in. if you used "&u=3", the requests would include both films and television shoes.
  • "&p=" + encodeURIComponent('google-apps') tells the Oracle of Bacon that the request is coming from Google Apps.

In the following line, the script fetches the XML data from the Oracle of Bacon and puts the data in the variable text . The parameters argument passes the parameters defined in the previous code block to the Oracle of Bacon web site.

var text = UrlFetchApp.fetch("http://oracleofbacon.org/cgi-bin/xml",
                                parameters).getContentText();

Here's an example of the data returned by the Oracle of Bacon:

 <?xml version="1.0" standalone="no"?>
 <link>
   <actor>Arnold Schwarzenegger</actor>
   <movie>Total Recall (1990)</movie>
   <actor>Anne Lockhart (I)</actor>
   <movie>The Last Song (2010)</movie>
   <actor>Miley Cyrus</actor>
 </link>
 

The XML data describe each actor and film using <actor> and <movie> tags.

The function kb is followed by a test function, testBacon .

The last function in the script, the parse function, is where the XML data is parsed and the information to be displayed is extracted from the XML data. The script parses the data in the variable txt using the Xml method parse . The parse method takes XML content provided as a String and returns as XML document, called doc in this case.

function parse(txt) {
var doc = Xml.parse(txt, true);

The following code extracts three XmlElement objects from the XML document doc . The first contains the html section, the second contains the head section, and the third contains the body.

var html = doc.html;       
var head = doc.html.head;       
var body = doc.html.body;

Because the Oracle of Bacon data contains only <html> and <head> tags, the call to doc.html.body fails. There is no <body> tag. The call to doc.html.head returns XML-tagged data containing <movie> and <actor> tags, typically several of each. The following code uses explicit methods to extract the movie and actor names from doc and put them into arrays call movies and actors :

var movies = doc.html.head.getElements("movie");
var actors = doc.html.head.getElements("actor");

Finally, the following code looks for matches and displays the linkage between the two actors:

  var movieIndex = 0;
  var r = '';
  var firstPerson = true;
  for (var i in actors) {
    r = r + actors[i].getText();
    if (movies[movieIndex]) {
      r = r + (firstPerson ? "" : " who") + " was in " +
         movies[movieIndex].getText() + " with ";
    }
    movieIndex++;
    firstPerson = false;
  }
  
  return r;
  }

The Full Kevin Bacon Script

Here's the full script:

// This script makes use of the Oracle of Bacon,
// a very cool site at http://oracleofbacon.org.
// Takes two arguments: names of actors.  The second 
// is optional, and defaults to "Kevin Bacon" if unspecified.
//
// Example usage in a spreadsheet:
// =kb("Miley Cyrus", "Arnold Schwarzenegger")
// function kb(from, to) { if (!to) { to = "Kevin Bacon"; } var parameters = { method : "post", payload : "a=" + encodeURIComponent(from) + "&b=" + encodeURIComponent(to) + "&u=1" + // movies only. Use "3" to include TV "&p=" + encodeURIComponent('google-apps') }; var text = UrlFetchApp.fetch("http://oracleofbacon.org/cgi-bin/xml", parameters).getContentText(); return parse(text); } function testBacon() { var result = kb("Miley Cyrus", "Arnold Schwarzenegger"); Logger.log(result); } function parse(txt) { var doc = Xml.parse(txt, true); var attr = doc.spellcheck; if (attr) { return "Cannot find actor: " + attr.name; } var actors = doc.html.head.getElements("actor"); var movies = doc.html.head.getElements("movie"); if (!actors || actors.length ==0) { return "no match found"; } var movieIndex = 0; var r = ''; var firstPerson = true; for (var i in actors) { r = r + actors[i].getText(); if (movies[movieIndex]) { r = r + (firstPerson ? "" : " who") + " was in " + movies[movieIndex].getText() + " with "; } movieIndex++; firstPerson = false; } return r; }

Summary

Congratulations, you've completed this tutorial. You should now be able to write scripts that parse XML data and use the results in your scripts.

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.