DSPL Tutorial

DSPL stands for Dataset Publishing Language. Datasets described in DSPL can be imported into the Google Public Data Explorer, a tool that allows for rich, visual exploration of the data.

Note: To upload data to Google Public Data using the Public Data upload tool, you must have a Google Account.

This tutorial provides a step-by-step example of how to prepare a basic DSPL dataset.

A DSPL dataset is a bundle that contains an XML file and a set of CSV files. The CSV files are simple tables containing the data of the dataset. The XML file describes the metadata of the dataset, including informational metadata like descriptions of measures, as well as structural metadata like references between tables. The metadata lets non-expert users explore and visualize your data.

The only prerequisite for understanding this tutorial is a good level of understanding of XML. Some understanding of simple database concepts (e.g., tables, primary keys) may help, but it's not required. For reference, the completed XML file and complete dataset bundle associated with this tutorial are also available for review.

Overview

Before starting to create our dataset, here is a high-level overview of what a DSPL dataset contains:

General information: About the dataset
Concepts: Definitions of "things" that appear in the dataset (e.g., countries, unemployment rate, gender, etc.)
Slices: Combinations of concepts for which there are data
Tables: Data for concepts and slices. Concept tables hold enumerations and slice tables hold statistical data
Topics: Used to organize the concepts of the dataset in a meaningful hierarchy through labeling

To illustrate these rather abstract notions, consider the dataset (with dummy data) used throughout this tutorial: statistical time series for population and unemployment, aggregated by various combinations of country, US state, and gender.

This example dataset defines the following concepts:

country
gender
population
state
unemployment rate
year

Concepts that are categorical, such as state, are associated with concept tables, which enumerate all their possible values (California, Arizona, etc.). Concepts may have additional columns for properties such as the name or the country of a state.

Slices define each combination of concepts for which there is statistical data in the dataset. A slice contains dimensions and metrics. In the above picture, the dimensions are blue and the metrics are orange. In this example, the slice gender_country_slice has data for the metric population and the dimensions country, year and gender. Another slice, called country_slice, gives total yearly population numbers (metric) for countries.

In addition to dimensions and metrics, slices also reference tables, which contain the actual data.

Let's now walk step-by-step through the creation of such a dataset in DSPL.

Dataset Information

To get started, we need to create an XML file for our dataset. Here is the beginning of a DSPL description for our example dataset:

<?xml version="1.0" encoding="UTF-8"?>
<dspl targetNamespace="http://www.stats-bureau.com/mystats"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xmlns="http://schemas.google.com/dspl/2010"
   xmlns:time="http://www.google.com/publicdata/dataset/google/time"
   xmlns:geo="http://www.google.com/publicdata/dataset/google/geo"
   xmlns:entity="http://www.google.com/publicdata/dataset/google/entity"
   xmlns:quantity="http://www.google.com/publicdata/dataset/google/quantity">

  <import namespace="http://www.google.com/publicdata/dataset/google/time"/>
  <import namespace="http://www.google.com/publicdata/dataset/google/entity"/>
  <import namespace="http://www.google.com/publicdata/dataset/google/geo"/>
  <import namespace="http://www.google.com/publicdata/dataset/google/quantity"/>

  <info>
    <name>
      <value>My statistics</value>
    </name>
    <description>
      <value>Some very interesting statistics about countries</value>
    </description>
    <url>
      <value>http://www.stats-bureau.com/mystats/info.html</value>
    </url>
    </info>

  <provider>
    <name>
      <value>Bureau of Statistics</value>
    </name>
    <url>
      <value>http://www.stats-bureau.com</value>
    </url>
  </provider>
 ...
</dspl>

The dataset description starts with a top-level <dspl> element. The targetNamespace attribute contains a URI that uniquely identifies this dataset. The dataset's namespace is especially important when publishing the dataset, as it will be the global identifier of your dataset, and the means for others to refer to it.

Note that the targetNamespace attribute may be omitted. In this case a unique namespace is automatically generated when the dataset is imported.

Using information from other datasets

Datasets can reuse definitions and data from other datasets by importing those datasets. Each <import> element specifies the namespace of another dataset this dataset will reference.

In our example dataset, we will need some definitions from http://www.google.com/publicdata/dataset/google/quantity (a dataset created by Google that contains concepts useful to define numerical quantities), and from the time, entity, and geo datasets, which provide definitions related to time, entities, and geography, respectively.

The top <dspl> element provides a namespace prefix declaration (e.g., xmlns:time="http://...") for each of the imported datasets. Prefix declarations are needed to reference elements from other datasets in a concise way. For instance, time:year references the definition of year in the imported dataset whose namespace is associated with the prefix time.

Dataset and provider info

The <info> element contains general information about the dataset: name, description, and a URL where more information can be found.

The <provider> element contains information about the provider of the dataset: its name and a URL where more information can be found (generally the data provider's home page).

Defining Concepts

Now that we have provided some general information about the dataset, we're ready to start defining its contents. Our next goal is to add population statistics for countries over the past 50 years.

The first thing we need to do is provide some definitions for the notions of population, country, and year. In DSPL, these definitions are called concepts.

A concept is a definition of a type of data that appears in a dataset. The data values that correspond to a given concept are called instances of that concept.

Population

Let's start with defining the population concept. In a DSPL document, concepts are defined in a <concepts> element that comes right after the dataset and provider information.

Here is a population concept with just the minimal information required for any concept: id (a unique identifier), name and type.

<dspl ...>
  ...
  <concepts>
    <concept id="population">
      <info>
        <name>
          <value>Population</value>
        </name>
      </info>
      <type ref="integer"/>
    </concept>
  ...
  </concepts>

Here's how this sample works:

Every concept must provide an id that uniquely identifies the concept within the dataset. This means that no two concepts defined in the same dataset can have the same id.
Just like for the dataset and its provider, the <info> elements provide textual information about the concept, such as its name and description.
The <type> element specifies the data type for the instances of the concept (in other words, its "values"). In this example, the type of population is integer. DSPL supports the following data types:
- string
- integer
- float
- boolean
- date

Country

Let's now write the definition of the country concept:

<concept id="country">
  <info>
    <name><value>Country</value></name>
    <description>
      <value>My list of countries.</value>
    </description>
  </info>
  <type ref="string"/>
  <property id="name">
    <info>
      <name><value>Name</value></name>
      <description>
        <value>The official name of the country</value>
      </description>
    </info>
    <type ref="string" />
  </property>
  <table ref="countries_table" />
</concept>

The country concept definition starts like the previous one, with an id, info, and a type.

Concept values

Categorical concepts like countries have an enumeration of all possible instances. In other words, you can list all possible countries that can be referenced. But in order to do this, each country needs a unique identifier. This example uses ISO country codes to identify countries; these codes are of type string.

In this example, you don't need to use the ISO code; you could just as well use the country name. Names, however, differ by language, can change over time, and are not always consistently used across datasets. For countries, and for categorical concepts in general, it's a good practice to pick, short, stable, commonly used, and language-independent identifiers (if they exist).

Concept properties

In addition to its id, the country concept has a <property> element that specifies the name of the country. In other words, the country name ("Ireland") is a property of the country with the id IE. Properties are how DSPL provides additional structured information about the instances of a concept.

Just like the concept itself, properties have an id, info, and type.

Concept data

Finally, the country concept has a <table> element. This element references a table that enumerates the list of all countries.

Using tables makes sense for some concepts, but not for others. For instance, it doesn't make sense to enumerate all the possible values for the concept population. However, if you do reference a table for a concept, that table must contain all instances of the concept—for example, it must list every country, not just a sample few.

The dataset defines the countries_table table as follows:

...
<tables>
  <table id="countries_table">
    <column id="country" type="string"/>
    <column id="name" type="string"/>
    <data>
      <file format="csv" encoding="utf-8">countries.csv</file>
    </data>
  </table>
  ...
</tables>

The countries table specifies the columns of the table and their types, and references a CSV file that contains the data. This CSV can either be bundled and uploaded with the dataset XML or remotely accessed via HTTP, HTTPS, or FTP. In the latter cases, you would replace countries.csv with a URL, for instance http://www.myserver.com/mydata/countries.csv.

Wherever it's stored, the CSV file looks like:

country, name
AD,      Andorra
AF,      Afghanistan
AI,      Anguilla
AL,      Albania
US,      United States

The first row of the table lists the column ids, as specified in the DSPL table definition. Each of the following rows corresponds to one instance of the country concept. If the concept has a table, then the table must contain all the instances of the concept- in this case, it must list all of the countries.

The columns are mapped to the country concept and its properties based on their id. The first column's id, country, matches the concept id. This means that this column contains the unique country identifier defined by the country concept. The next column corresponds to the name property of the country concept. The values in this column match the values of the name property.

There are a few requirements for the CSV data for the concept table:

The column headings in the first line of the data file must exactly match the concept id and the property id of the concept with which the data is associated (though order may vary).
Each row must have exactly the same number of elements as the number of properties on the concept (even if the value is empty).
Each value for the concept's id field (here, the country code) must be unique and non-empty (an empty field is one with zero or only whitespace characters).
Values for properties that reference other concepts must either be empty or be a valid value of the referenced concept.
Values that contain commas, double quotes, or newline characters must be fully enclosed in double quotes.
Any literal double quote characters inside a value must be immediately preceded by another double quote.

Year

The last concept we need for our country population data is a concept to represent years. Instead of defining a new concept, we will use the year concept from one of the datasets we imported: "http://www.google.com/publicdata/dataset/google/time". To do this, we need to reference it as time:year, where time represents the dataset that is referenced, and year identifies the concept.

Canonical concepts

time:year is part of a small set of canonical concepts defined by Google. Canonical concepts provide basic definitions for time, geography, numerical quantities, units, etc.

In fact, the country concept defined above exists as a canonical concept. We only created it here for illustration purposes. Whenever possible, you should use canonical concepts in your datasets, either directly or by extending them (more on extension below). Canonical concepts make your data comparable to other datasets, and enable features for your datasets in the Public Data Explorer. For instance, animating data over time or showing geographical data on a map rely on using the time and geo canonical concepts, respectively.

First Slice

Now that we have concepts for population, country, and year, it's time to put them together!

For that, we need to create a slice that combines them. In DSPL, a slice is a combination of concepts for which data exists.

Why not just create a table with the right columns? Because slices capture the dataset's information in terms of its concepts. This will become clearer as we create more pieces of our dataset.

Slices appear in the DSPL file under a <slices> element, which must appear right after the concepts section.

<slices>
  <slice id="countries_slice">
    <dimension concept="country"/>
    <dimension concept="time:year"/>
    <metric concept="population"/>
    <table ref="countries_slice_table"/>
  </slice>
</slices>

Just like concepts, each slice has an id (countries_slice) that uniquely identifies the slice within the dataset.

A slice contains two kinds of concept references: Dimensions and metrics. The values of metrics vary with the values of dimensions. Here, the value of population (the metric) varies by the dimensions country and year.

Just like concepts, slices include a reference to a table that contains the data of the slice. The referenced table must have one column for each dimension and metric of the slice. Just as for concepts, the slice's dimensions and metrics are mapped to the table columns with the same ids.

Slice table

The table for our population slice appears in the tables section of the DSPL file:

<tables>
  ...
  <table id="countries_slice_table">
    <column id="country" type="string"/>
    <column id="year" type="date" format="yyyy"/>
    <column id="population" type="integer"/>
    <data>
      <file format="csv" encoding="utf-8">country_slice.csv</file>
    </data>
  </table>
  ...
</tables>

Note that the year column comes with a format attribute that specifies how years are formatted. Supported date formats are those defined by the Joda DateTime format.

The countries_slice table specifies the columns of the table and their types, and points to a CSV file that contains the data. The CSV file looks like this:

country, year,  population
AF,      1960,  9616353
AF,      1961,  9799379
AF,      1962,  9989846
AF,      1963,  10188299
...

Each row of the data table contains a unique combination of the dimensions country and year, along with the corresponding value of the population metric (for example, the population - metric - of Afghanistan in 1960 - dimensions).

Note that the values in the country column match the value/identifier of the country concept, which is the ISO 3166 2-letter code of the country.

The CSV data for a slice must satisfy the following constraints:

Each value of a dimension field (such as country and year) must be non-empty. Values for metric fields (such as population) can be empty. An empty value is represented by no characters.
Each value of a dimension field that references a concept must be present in that concept's data. For example, the value AF must be present in the country concept data table.
Each unique combination of dimension values, e.g. AF, 2000, may occur only once.
Data should be sorted by the non-time dimension columns (in any order), and then, optionally, by any of the other columns. So, for instance, in a table with the columns [date, dimension1, dimension2, metric1, metric2], you can sort by dimension1, then dimension2, then date, but not by date and then the dimensions.

Summary

At this point, we have enough in our DSPL to describe the country population data. To recap, what we had to do was:

Create the DSPL header and description of the dataset and its provider
Create one concept for population and another one for country, with a csv file enumerating all countries and their names.
Create a slice with our population numbers for countries over time, referencing the already-defined year concept in the imported time-dataset from Google.

In the remainder of this tutorial, we will make our dataset richer by adding more dimensions in more slices, as well as more metrics grouped by topic.

Adding a Dimension: US States

Let's now enrich our dataset by adding population data for states in the US. We first need to define a concept for states. This looks very much like the country concept that we defined before.

<concept id="state" extends="geo:location">
  <info>
    <name>
      <value>state</value>
    </name>
    <description>
      <value>US states, identified by their two-letter code.</value>
    </description>
  </info>
  <property concept="country" isParent="true" />
  <table ref="states_table"/>
</concept>

Concept extensions and property references

The state concept introduces several new features of DSPL.

First, state extends another concept, geo:location (defined in the external geo dataset we imported at the beginning of our dataset). Semantically, this means that state is a kind of geo:location. A consequence is that it inherits all the attributes and properties of geo:location. In particular, location defines properties for latitude and longitude; by extending the former concept, these properties are applied to state as well. Moreover, since location inherits from entity:entity, state also gets all of the latter's properties, including name, description, and info_url.

Note: The country concept defined previously should have, technically, also extended from geo:location. This point was omitted before for simplicity; we've included the location to country inheritance, however, in the final XML file.

Note: You can use the extends construct in your own datasets to reuse information defined by other datasets. Using extends requires that all the instances of your concept are valid instances of the concept you are extending. Extensions let you add additional properties and attributes, and restrict the set of instances to a subset of the instances of the extended concept.

In addition to inheritance, the state property also introduces the idea of concept references. In particular, the state concept has a property called country, which references the country concept we created above. This is done by using a concept attribute. Note that this property does not provide an id, only a concept reference. This is equivalent to creating an id with the same value as the id of the referenced concept (i.e., country in this example). The hierarchical relationship between state and county is captured by having an attribute isParent="true" on the reference. In general, dimensions with hierarchical relationships, such as geographies, should be represented this way, with the child concept having a property that references the parent concept using the isParent attribute.

The table definition for states looks like this:

<tables>
  ...
  <table id="states_table">
    <column id="state" type="string"/>
    <column id="name" type="string"/>
    <column id="country" type="string">
      <value>US</value>
    </column>
    <column id="latitude" type="float"/>
    <column id="longitude" type="float"/>
    <data>
      <file format="csv" encoding="utf-8">states.csv</file>
    </data>
  </table>
  ...
</tables>

The country column has a constant value for all states. Specifying it in the DSPL avoids repeating that value for every state in the data. Also note that we've included columns for name, latitude, and longitude since state has inherited these properties from geo:location. On the other hand, some inherited properties (e.g., description) do not have columns; this is okay- if a property is omitted from a concept definition table, then its value is assumed to be undefined for every instance of the concept.

The CSV file looks like this:

state, name,        latitude,  longitude
AL,    Alabama,     32.318231, -86.902298
AK,    Alaska,      63.588753, -154.493062
AR,    Arkansas,    35.20105,  -91.831833
AZ,    Arizona,     34.048928, -111.093731
CA,    California,  36.778261, -119.417932
CO,    Colorado,    39.550051, -105.782067
CT,    Connecticut, 41.603221, -73.087749
...

Since we already have concepts for population and year, we can re-use them to define a new slice for states population.

<slices>
  <slice id="states_slice">
    <dimension concept="state"/>
    <dimension concept="time:year"/>
    <metric concept="population"/>
    <table ref="states_slice_table"/>
  </slice>
</slices>

The data table definition looks like this:

<tables>
  ...
  <table id="states_slice_table">
    <column id="state" type="string"/>
    <column id="year" type="date" format="yyyy"/>
    <column id="population" type="integer"/>
    <file format="csv" encoding="utf-8">state_slice.csv</file>
  </table>
  ...
</tables>

And the CSV file looks like this:

state, year, population
AL,    1960, 9616353
AL,    1961, 9799379
AL,    1962, 9989846
AL,    1963, 10188299

Wait, why did we create a new slice, instead of adding another dimension to the previous one?

A slice with dimensions for both state and country would not be correct, because some rows would be for country data and some rows would be for state data. The table would have "holes" for some dimensions, which is not allowed (remember that missing values are allowed only for metrics and not dimensions).

Dimensions act as a "primary key" for the slice. This means that every data row must have values for all the dimensions and no two data rows can have the exact same values for all dimensions.

Adding a Metric: Unemployment Rate

Let's now add another metric to our dataset:

<concept id="unemployment_rate" extends="quantity:rate">
  <info>
    <name>
      <value>Unemployment rate</value>
    </name>
    <description>
      <value>The percent of the labor force that is unemployed.</value>
    </description>
    <url>
      <value>http://www.bls.gov/cps/cps_htgm.htm</value>
    </url>
  </info>
  <type ref="float/>
  <attribute id="is_percentage">
    <type ref="boolean"/>
    <value>true</value>
  </attribute>
</concept>

The info section of this metric has a name, description and a URL (linking to the US Bureau of Labor Statistics).

This concept also extends the quantity:rate canonical concept. The quantity dataset defines core concepts for representing numerical quantities. In your dataset, you should create your numerical concepts by extending the appropriate quantity concept. Thus, the population concept defined above should have, technically, been extended from quantity:amount.

Concept attributes

This concept also introduces the construct of an attribute. In this example, an attribute is used to say that unemployment_rate is a percentage. The is_percentage attribute is inherited from the quantity:rate concept that this concept extends. This information is used by the Public Data Explorer to show percent signs when visualizing the data.

Attributes provide a general mechanism to attach key/value pairs to a concept (by contrast with properties, which associate additional values with instances of a concept). Just like concepts and properties, attributes have an id, info, and a type. Like properties, they can reference other concepts.

Attributes are not just for predefined general things, such as numerical properties. You can define your own attributes for your concepts.

Adding Unemployment Rate Data for US States

We are now ready to add unemployment rate data for US states. Because unemployment rate is a metric and we already have population data for states, we can just add it to the slice we already created for the state and year dimensions:

<slices>
  ...
  <slice id="states_slice">
    <dimension concept="state"/>
    <dimension concept="time:year"/>
    <metric concept="population"/>
    <metric concept="unemployment_rate"/>
    <table ref="states_slice_table"/>
  </slice>
  ...
</slices>

... and add another column to the table definition:

<tables>
  ...
  <table id="states_slice_table">
    <column id="state" type="string"/>
    <column id="year" type="date" format="yyyy"/>
    <column id="population" type="integer"/>
    <column id="unemployment_rate" type="float"/>
    <data>
      <file format="csv" encoding="utf-8">state_slice.csv</file>
    </data>
  </table>
  ...
</tables>

... and to the CSV file:

state, year, population, unemployment_rate
AL,    1960, 9616353,    5.1
AL,    1961, 9799379,    5.2
AL,    1962, 9989846,    4.8
AL,    1963, 10188299,   6.9

We said previously that for every slice, the dimensions form a primary key for the slice. In addition, each dataset can only contain one slice for a given combination of dimensions. All the metrics that are available for these dimensions must belong to that same slice.

More Dimensions: Population Breakdown by Gender

Let's enrich our dataset with a breakdown of population by gender for countries. By now, you're starting to know the drill... We first need to add a concept for gender:

<concept id="gender" extends="entity:entity">
  <info>
    <name>
      <value>Gender</value>
    </name>
    <description>
      <value>Gender, Male or Female</value>
    </description>
    <pluralName>
      <value>Genders</value>
    </pluralName>
    <totalName>
      <value>Both genders</value>
    </totalName>
  </info>
  <type ref="string"/>
  <table ref="genders_table"/>
</concept>

The gender concept info section has a pluralName, which provides the text to be used to refer to multiple instances of the gender concept. The info section also includes a totalName, which provides the text to be used to refer to all instances of the gender concept as a whole. Both of these are used by the Public Data Explorer to display information related to the gender concept. In general, you should provide them for concepts that can be used as dimensions.

Note that the gender concept is also extending from entity:entity. This is a good practice for concepts that are used as dimensions, as it allows you to add custom names, URLs, and colors for the various concept instances.

The gender concept refers to the genders_table table, which contains the possible values for gender and their display names (omitted here).

To add population by gender to our dataset, we need to create a new slice (remember: each available combination of dimensions corresponds to a slice in the dataset).

<slice id="countries_gender_slice">
  <dimension concept="country"/>
  <dimension concept="gender"/>
  <dimension concept="time:year"/>
  <metric concept="population"/>
  <table ref="countries_gender_slice_table"/>
</slice>

The table definition for the slice looks like this:

<table id="countries_gender_slice_table">
  <column id="country" type="string"/>
  <column id="gender" type="string"/>
  <column id="year" type="date" format="yyyy"/>
  <column id="population" type="integer"/>
  <data>
    <file format="csv" encoding="utf-8">gender_country_slice.csv</file>
  </data>
</table>

The CSV file for the table looks like this:

country, gender, year, population
AF,      M,      1960, 4808176
AF,      F,      1960, 4808177
AF,      M,      1961, 4899689
AF,      F,      1961, 4899690...

Compared to the previous countries, population, and unemployment slice, this one has an additional dimension; each value of the population metric corresponds not only to a particular country and year, but also to a particular gender.

Note that we have created a "sparse" dataset. Not all metrics are available for all dimensions: population is available for countries and US states, on a yearly basis, while unemployment rate is only available for countries. The breakdown by gender is available for population by country only; it is not available for the unemployment rate metric, and not for the state dimension. Sparsity can also exist at the data level, with certain metrics not having values for certain dimension values, but that's not represented in DSPL.

Topics

The last feature of DSPL we will use in our dataset is topics. Topics are used to classify concepts hierarchically, and are used by applications to help users navigate to your data.

In the DSPL file, topics appear right before concepts. Here is a sample topic hierarchy:

<dspl ... >
  ...
  <topics>
    <topic id="geography">
      <info>
        <name>
          <value>Geography</value>
        </name>
      </info>
    </topic>
    <topic id="social_indicators">
      <info>
        <name>
          <value>Social indicators</value>
        </name>
      </info>
    </topic>
    <topic id="population_indicators">
      <info>
        <name>
          <value>Population indicators</value>
        </name>
      </info>
    </topic>
    <topic id="poverty_and_income">
      <info>
        <name>
          <value>Poverty & income</value>
        </name>
      </info>
    </topic>
    <topic id="health">
      <info>
        <name>
          <value>Health</value>
        </name>
      </info>
    </topic>
  </topics>

You can nest topics as deeply as necessary.

To use topics, you just need to reference them from the concept definition, as follows:

<concept id="population">
  <info>
    <name>
      <value>Population</value>
    </name>
    <description>
      <value>Size of the resident population.</value>
    </description>
  </info>
  <topic ref="population_indicators"/>
  <type ref="integer"/>
</concept>

A concept may reference more than one topic.

Submitting Your Dataset

Now that you've created your dataset, the next step is to zip it and upload the zip file to the Google Public Data Explorer tool. If you encounter any problems, check the FAQ, which includes a discussion of the most common uploading issues.

For reference, you can also download the complete XML file and complete dataset bundle associated with this tutorial.

Where to Go From Here

Congratulations on creating your first DSPL dataset! Now that you understand the basics, we recommend reading through the Developer Guide, which, among other things, documents "advanced" DSPL features like multi-language support and mappable concepts.

You may also want to look at some more example datasets.