DSPL stands for Dataset Publishing Language. Datasets described in DSPL can be imported into the Google Public Data Explorer, a tool that allows for rich, visual exploration of the data.
Note: To upload data to Google Public Data using the Public Data upload tool, you must have a Google Account.
This tutorial provides a step-by-step example of how to prepare a basic DSPL dataset.
A DSPL dataset is a bundle that contains an XML file and a set of CSV files. The CSV files are simple tables containing the data of the dataset. The XML file describes the metadata of the dataset, including informational metadata like descriptions of measures, as well as structural metadata like references between tables. The metadata lets non-expert users explore and visualize your data.
The only prerequisite for understanding this tutorial is a good level of understanding of XML. Some understanding of simple database concepts (e.g., tables, primary keys) may help, but it's not required. For reference, the completed XML file and complete dataset bundle associated with this tutorial are also available for review.
Overview
Before starting to create our dataset, here is a high-level overview of what a DSPL dataset contains:
- General information: About the dataset
- Concepts: Definitions of "things" that appear in the dataset (e.g., countries, unemployment rate, gender, etc.)
- Slices: Combinations of concepts for which there are data
- Tables: Data for concepts and slices. Concept tables hold enumerations and slice tables hold statistical data
- Topics: Used to organize the concepts of the dataset in a meaningful hierarchy through labeling
To illustrate these rather abstract notions, consider the dataset (with dummy data) used throughout this tutorial: statistical time series for population and unemployment, aggregated by various combinations of country, US state, and gender.
This example dataset defines the following concepts:
- country
- gender
- population
- state
- unemployment rate
- year
Concepts that are categorical, such as state, are associated with concept tables, which enumerate all their possible values (California, Arizona, etc.). Concepts may have additional columns for properties such as the name or the country of a state.
Slices define each combination of concepts for which there is
statistical data in the dataset. A slice contains dimensions and
metrics. In the above picture, the dimensions are blue and the
metrics are orange. In this example, the slice
gender_country_slice
has data for the metric
population
and the dimensions country
,
year
and gender
. Another slice, called
country_slice
, gives total yearly population numbers (metric) for
countries.
In addition to dimensions and metrics, slices also reference tables, which contain the actual data.
Let's now walk step-by-step through the creation of such a dataset in DSPL.
Dataset Information
To get started, we need to create an XML file for our dataset. Here is the beginning of a DSPL description for our example dataset:
<?xml version="1.0" encoding="UTF-8"?> <dspl targetNamespace="http://www.stats-bureau.com/mystats" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.google.com/dspl/2010" xmlns:time="http://www.google.com/publicdata/dataset/google/time" xmlns:geo="http://www.google.com/publicdata/dataset/google/geo" xmlns:entity="http://www.google.com/publicdata/dataset/google/entity" xmlns:quantity="http://www.google.com/publicdata/dataset/google/quantity"> <import namespace="http://www.google.com/publicdata/dataset/google/time"/> <import namespace="http://www.google.com/publicdata/dataset/google/entity"/> <import namespace="http://www.google.com/publicdata/dataset/google/geo"/> <import namespace="http://www.google.com/publicdata/dataset/google/quantity"/> <info> <name> <value>My statistics</value> </name> <description> <value>Some very interesting statistics about countries</value> </description> <url> <value>http://www.stats-bureau.com/mystats/info.html</value> </url> </info> <provider> <name> <value>Bureau of Statistics</value> </name> <url> <value>http://www.stats-bureau.com</value> </url> </provider> ... </dspl>
The dataset description starts with a top-level <dspl>
element. The targetNamespace
attribute contains a URI that
uniquely identifies this dataset. The dataset's namespace is especially
important when publishing the dataset, as it will be the global identifier of
your dataset, and the means for others to refer to it.
Note that the targetNamespace
attribute may be omitted. In
this case a unique namespace is automatically generated when the dataset is
imported.
Using information from other datasets
Datasets can reuse definitions and data from other datasets by importing
those datasets. Each <import>
element specifies the
namespace of another dataset this dataset will reference.
In our example dataset, we will need some definitions from http://www.google.com/publicdata/dataset/google/quantity (a dataset created by Google that contains concepts useful to define numerical quantities), and from the time, entity, and geo datasets, which provide definitions related to time, entities, and geography, respectively.
The top <dspl>
element provides a namespace prefix
declaration (e.g., xmlns:time="http://..."
) for each
of the imported datasets. Prefix declarations are needed to reference
elements from other datasets in a concise way. For instance,
time:year
references the definition of year
in the
imported dataset whose namespace is associated with the prefix
time
.
Dataset and provider info
The <info>
element contains general information about
the dataset: name, description, and a URL where more information can be
found.
The <provider>
element contains information about the
provider of the dataset: its name and a URL where more information can be
found (generally the data provider's home page).
Defining Concepts
Now that we have provided some general information about the dataset, we're ready to start defining its contents. Our next goal is to add population statistics for countries over the past 50 years.
The first thing we need to do is provide some definitions for the notions of population, country, and year. In DSPL, these definitions are called concepts.
A concept is a definition of a type of data that appears in a dataset. The data values that correspond to a given concept are called instances of that concept.
Population
Let's start with defining the population concept. In a
DSPL document, concepts are defined in a <concepts>
element that comes right after the dataset and provider information.
Here is a population concept with just the minimal information required
for any concept: id
(a unique identifier), name
and
type
.
<dspl ...> ... <concepts> <concept id="population"> <info> <name> <value>Population</value> </name> </info> <type ref="integer"/> </concept> ... </concepts>
Here's how this sample works:
- Every concept must provide an
id
that uniquely identifies the concept within the dataset. This means that no two concepts defined in the same dataset can have the same id. - Just like for the dataset and its provider, the
<info>
elements provide textual information about the concept, such as its name and description. - The
<type>
element specifies the data type for the instances of the concept (in other words, its "values"). In this example, the type ofpopulation
isinteger
. DSPL supports the following data types:string
integer
float
boolean
date
Country
Let's now write the definition of the country concept:
<concept id="country"> <info> <name><value>Country</value></name> <description> <value>My list of countries.</value> </description> </info> <type ref="string"/> <property id="name"> <info> <name><value>Name</value></name> <description> <value>The official name of the country</value> </description> </info> <type ref="string" /> </property> <table ref="countries_table" /> </concept>
The country concept definition starts like the previous one,
with an id
, info
, and a type
.
Concept values
Categorical concepts like countries have an enumeration of all possible
instances. In other words, you can list all possible countries that can be
referenced. But in order to do this, each country needs a unique identifier.
This example uses
ISO country codes to identify countries; these codes are
of type string
.
In this example, you don't need to use the ISO code; you could just as well use the country name. Names, however, differ by language, can change over time, and are not always consistently used across datasets. For countries, and for categorical concepts in general, it's a good practice to pick, short, stable, commonly used, and language-independent identifiers (if they exist).
Concept properties
In addition to its id
, the country concept has a
<property>
element that specifies the name of the country.
In other words, the country name ("Ireland") is a property
of the country with the id
IE. Properties are how DSPL provides
additional structured information about the instances of a concept.
Just like the concept itself, properties have an id
,
info
, and type
.
Concept data
Finally, the country concept has a <table>
element.
This element references a table that enumerates the list of all
countries.
Using tables makes sense for some concepts, but not for others. For instance, it doesn't make sense to enumerate all the possible values for the concept population. However, if you do reference a table for a concept, that table must contain all instances of the concept—for example, it must list every country, not just a sample few.
The dataset defines the countries_table
table as follows:
... <tables> <table id="countries_table"> <column id="country" type="string"/> <column id="name" type="string"/> <data> <file format="csv" encoding="utf-8">countries.csv</file> </data> </table> ... </tables>
The countries table specifies the columns of the table and their types,
and references a CSV file that contains the data. This CSV can either be
bundled and uploaded with the dataset XML or remotely accessed via HTTP, HTTPS,
or FTP. In the latter cases, you would replace countries.csv
with
a URL, for instance http://www.myserver.com/mydata/countries.csv
.
Wherever it's stored, the CSV file looks like:
country, name AD, Andorra AF, Afghanistan AI, Anguilla AL, Albania US, United States
The first row of the table lists the column ids, as specified in the DSPL
table
definition. Each of the following rows corresponds to one
instance of the country concept. If the concept has a table, then
the table must contain all the instances of the concept- in this
case, it must list all of the countries.
The columns are mapped to the country concept and its properties based on
their id. The first column's id, country
, matches the concept
id. This means that this column contains the unique country identifier
defined by the country concept. The next column corresponds to
the name
property of the country concept. The values
in this column match the values of the name
property.
There are a few requirements for the CSV data for the concept table:
- The column headings in the first line of the data file must
exactly match the concept
id
and the propertyid
of the concept with which the data is associated (though order may vary). - Each row must have exactly the same number of elements as the number of properties on the concept (even if the value is empty).
- Each value for the concept's
id
field (here, the country code) must be unique and non-empty (an empty field is one with zero or only whitespace characters). - Values for properties that reference other concepts must either be empty or be a valid value of the referenced concept.
- Values that contain commas, double quotes, or newline characters must be fully enclosed in double quotes.
- Any literal double quote characters inside a value must be immediately preceded by another double quote.
Year
The last concept we need for our country population data is a concept to
represent years. Instead of defining a new concept, we will use the
year concept from one of the datasets we imported:
"http://www.google.com/publicdata/dataset/google/time". To do this,
we need to reference it as time:year
, where time
represents the dataset that is referenced, and year
identifies
the concept.
Canonical concepts
time:year
is part of a small set of canonical concepts
defined by Google. Canonical concepts provide basic definitions for time,
geography, numerical quantities, units, etc.
In fact, the country concept defined above exists as a
canonical concept. We only created it here for illustration purposes.
Whenever possible, you should use canonical concepts in your datasets, either
directly or by extending them (more on extension below). Canonical concepts
make your data comparable to other datasets, and enable features for your
datasets in the Public Data Explorer. For instance, animating data over time
or showing geographical data on a map rely on using the time
and
geo
canonical concepts, respectively.
First Slice
Now that we have concepts for population, country, and year, it's time to put them together!
For that, we need to create a slice that combines them. In DSPL, a slice is a combination of concepts for which data exists.
Why not just create a table with the right columns? Because slices capture the dataset's information in terms of its concepts. This will become clearer as we create more pieces of our dataset.
Slices appear in the DSPL file under a <slices>
element, which must appear right after the concepts
section.
<slices> <slice id="countries_slice"> <dimension concept="country"/> <dimension concept="time:year"/> <metric concept="population"/> <table ref="countries_slice_table"/> </slice> </slices>
Just like concepts, each slice has an id
(countries_slice
) that uniquely identifies the slice within the
dataset.
A slice contains two kinds of concept references: Dimensions and
metrics. The values of metrics vary with the values of
dimensions. Here, the value of population
(the metric) varies by
the dimensions country
and year
.
Just like concepts, slices include a reference to a table that contains the data of the slice. The referenced table must have one column for each dimension and metric of the slice. Just as for concepts, the slice's dimensions and metrics are mapped to the table columns with the same ids.
Slice table
The table for our population slice appears in the tables
section of the DSPL file:
<tables> ... <table id="countries_slice_table"> <column id="country" type="string"/> <column id="year" type="date" format="yyyy"/> <column id="population" type="integer"/> <data> <file format="csv" encoding="utf-8">country_slice.csv</file> </data> </table> ... </tables>
Note that the year
column comes with a format
attribute that specifies how years are formatted. Supported date formats are
those defined by the Joda DateTime format.
The countries_slice
table specifies the columns of the table and
their types, and points to a CSV file that contains the data. The CSV file
looks like this:
country, year, population AF, 1960, 9616353 AF, 1961, 9799379 AF, 1962, 9989846 AF, 1963, 10188299 ...
Each row of the data table contains a unique combination of the dimensions
country
and year
, along with the corresponding value
of the population
metric (for example, the population -
metric - of Afghanistan in 1960 - dimensions).
Note that the values in the country
column match the
value/identifier of the country
concept, which is the ISO 3166
2-letter code of the country.
The CSV data for a slice must satisfy the following constraints:
- Each value of a dimension field (such as
country
andyear
) must be non-empty. Values for metric fields (such aspopulation
) can be empty. An empty value is represented by no characters. - Each value of a dimension field that references a concept must be
present in that concept's data. For example, the value
AF
must be present in thecountry
concept data table. - Each unique combination of dimension values, e.g.
AF, 2000
, may occur only once. - Data should be sorted by the non-time dimension columns (in any order),
and then, optionally, by any of the other columns. So, for instance,
in a table with the columns
[date, dimension1, dimension2, metric1, metric2]
, you can sort bydimension1
, thendimension2
, thendate
, but not bydate
and then the dimensions.
Summary
At this point, we have enough in our DSPL to describe the country population data. To recap, what we had to do was:
- Create the DSPL header and description of the dataset and its provider
- Create one concept for population and another one for country, with a csv file enumerating all countries and their names.
- Create a slice with our population numbers for countries over time, referencing the already-defined year concept in the imported time-dataset from Google.
In the remainder of this tutorial, we will make our dataset richer by adding more dimensions in more slices, as well as more metrics grouped by topic.
Adding a Dimension: US States
Let's now enrich our dataset by adding population data for states in the US. We first need to define a concept for states. This looks very much like the country concept that we defined before.
<concept id="state" extends="geo:location"> <info> <name> <value>state</value> </name> <description> <value>US states, identified by their two-letter code.</value> </description> </info> <property concept="country" isParent="true" /> <table ref="states_table"/> </concept>
Concept extensions and property references
The state concept introduces several new features of DSPL.
First, state extends another concept,
geo:location
(defined in the external geo dataset we
imported at the beginning of our dataset). Semantically, this means that
state
is a kind of geo:location
. A consequence is
that it inherits all the attributes and properties of
geo:location
. In particular, location defines properties for
latitude
and longitude
; by extending the former
concept, these properties are applied to state as well. Moreover, since
location inherits from entity:entity
, state also gets
all of the latter's properties, including name
,
description
, and info_url
.
Note: The country concept defined previously
should have, technically, also extended from geo:location
.
This point was omitted before for simplicity; we've included the
location to country inheritance, however, in the
final XML file.
Note: You can use the extends
construct in your own datasets to reuse information defined by other datasets.
Using extends
requires that all the instances of your concept are
valid instances of the concept you are extending. Extensions let you add
additional properties and attributes, and restrict the set of instances to a
subset of the instances of the extended concept.
In addition to inheritance, the state property also introduces the
idea of concept references.
In particular, the state concept has a property called country
,
which references the country concept we created above. This is done by
using a concept
attribute. Note that this property does not
provide an id, only a concept reference. This is equivalent to creating an id
with the same value as the id of the referenced concept (i.e.,
country
in this example). The hierarchical relationship between
state and county is captured by having an attribute
isParent="true"
on the reference. In general,
dimensions with hierarchical relationships, such as geographies, should be
represented this way, with the child concept having a property that
references the parent concept using the isParent
attribute.
The table definition for states looks like this:
<tables> ... <table id="states_table"> <column id="state" type="string"/> <column id="name" type="string"/> <column id="country" type="string"> <value>US</value> </column> <column id="latitude" type="float"/> <column id="longitude" type="float"/> <data> <file format="csv" encoding="utf-8">states.csv</file> </data> </table> ... </tables>
The country column has a constant value for all states. Specifying it in
the DSPL avoids repeating that value for every state in the data. Also note
that we've included columns for name
, latitude
, and
longitude
since state has inherited these properties from
geo:location
. On the other hand, some inherited properties
(e.g., description
) do not have columns; this is okay-
if a property is omitted from a concept definition table, then its value is
assumed to be undefined for every instance of the concept.
The CSV file looks like this:
state, name, latitude, longitude AL, Alabama, 32.318231, -86.902298 AK, Alaska, 63.588753, -154.493062 AR, Arkansas, 35.20105, -91.831833 AZ, Arizona, 34.048928, -111.093731 CA, California, 36.778261, -119.417932 CO, Colorado, 39.550051, -105.782067 CT, Connecticut, 41.603221, -73.087749 ...
Since we already have concepts for population and year, we can re-use them to define a new slice for states population.
<slices> <slice id="states_slice"> <dimension concept="state"/> <dimension concept="time:year"/> <metric concept="population"/> <table ref="states_slice_table"/> </slice> </slices>
The data table definition looks like this:
<tables> ... <table id="states_slice_table"> <column id="state" type="string"/> <column id="year" type="date" format="yyyy"/> <column id="population" type="integer"/> <file format="csv" encoding="utf-8">state_slice.csv</file> </table> ... </tables>
And the CSV file looks like this:
state, year, population AL, 1960, 9616353 AL, 1961, 9799379 AL, 1962, 9989846 AL, 1963, 10188299
Wait, why did we create a new slice, instead of adding another dimension to the previous one?
A slice with dimensions for both state and country would not be correct, because some rows would be for country data and some rows would be for state data. The table would have "holes" for some dimensions, which is not allowed (remember that missing values are allowed only for metrics and not dimensions).
Dimensions act as a "primary key" for the slice. This means that every data row must have values for all the dimensions and no two data rows can have the exact same values for all dimensions.
Adding a Metric: Unemployment Rate
Let's now add another metric to our dataset:
<concept id="unemployment_rate" extends="quantity:rate"> <info> <name> <value>Unemployment rate</value> </name> <description> <value>The percent of the labor force that is unemployed.</value> </description> <url> <value>http://www.bls.gov/cps/cps_htgm.htm</value> </url> </info> <type ref="float/> <attribute id="is_percentage"> <type ref="boolean"/> <value>true</value> </attribute> </concept>
The info
section of this metric has a name, description and a
URL (linking to the US Bureau of Labor Statistics).
This concept also extends the quantity:rate
canonical concept.
The quantity
dataset defines core concepts for representing numerical quantities. In
your dataset, you should create your numerical concepts by extending the
appropriate quantity concept. Thus, the population
concept
defined above should have, technically, been extended from
quantity:amount
.
Concept attributes
This concept also introduces the construct of an attribute. In
this example, an attribute is used to say that unemployment_rate
is a percentage. The is_percentage
attribute is inherited from
the quantity:rate
concept that this concept extends. This
information is used by the Public Data Explorer to show percent signs when
visualizing the data.
Attributes provide a general mechanism to attach key/value pairs to a
concept (by contrast with properties, which associate additional values with
instances of a concept). Just like concepts and properties,
attributes have an id
, info
, and a
type
. Like properties, they can reference other concepts.
Attributes are not just for predefined general things, such as numerical properties. You can define your own attributes for your concepts.
Adding Unemployment Rate Data for US States
We are now ready to add unemployment rate data for US states. Because unemployment rate is a metric and we already have population data for states, we can just add it to the slice we already created for the state and year dimensions:
<slices> ... <slice id="states_slice"> <dimension concept="state"/> <dimension concept="time:year"/> <metric concept="population"/> <metric concept="unemployment_rate"/> <table ref="states_slice_table"/> </slice> ... </slices>
... and add another column to the table definition:
<tables> ... <table id="states_slice_table"> <column id="state" type="string"/> <column id="year" type="date" format="yyyy"/> <column id="population" type="integer"/> <column id="unemployment_rate" type="float"/> <data> <file format="csv" encoding="utf-8">state_slice.csv</file> </data> </table> ... </tables>
... and to the CSV file:
state, year, population, unemployment_rate AL, 1960, 9616353, 5.1 AL, 1961, 9799379, 5.2 AL, 1962, 9989846, 4.8 AL, 1963, 10188299, 6.9
We said previously that for every slice, the dimensions form a primary key for the slice. In addition, each dataset can only contain one slice for a given combination of dimensions. All the metrics that are available for these dimensions must belong to that same slice.
More Dimensions: Population Breakdown by Gender
Let's enrich our dataset with a breakdown of population by gender for countries. By now, you're starting to know the drill... We first need to add a concept for gender:
<concept id="gender" extends="entity:entity"> <info> <name> <value>Gender</value> </name> <description> <value>Gender, Male or Female</value> </description> <pluralName> <value>Genders</value> </pluralName> <totalName> <value>Both genders</value> </totalName> </info> <type ref="string"/> <table ref="genders_table"/> </concept>
The gender concept info
section has a
pluralName
, which provides the text to be used to refer to
multiple instances of the gender concept. The info
section also
includes a totalName
, which provides the text to be used to
refer to all instances of the gender concept as a whole. Both of these are
used by the Public Data Explorer to display information related to the gender
concept. In general, you should provide them for concepts that can be used as
dimensions.
Note that the gender concept is also extending from
entity:entity
. This is a good practice for concepts
that are used as dimensions, as it allows you to add custom names,
URLs, and colors for the various concept instances.
The gender concept refers to the genders_table
table, which
contains the possible values for gender and their display names
(omitted here).
To add population by gender to our dataset, we need to create a new slice (remember: each available combination of dimensions corresponds to a slice in the dataset).
<slice id="countries_gender_slice"> <dimension concept="country"/> <dimension concept="gender"/> <dimension concept="time:year"/> <metric concept="population"/> <table ref="countries_gender_slice_table"/> </slice>
The table definition for the slice looks like this:
<table id="countries_gender_slice_table"> <column id="country" type="string"/> <column id="gender" type="string"/> <column id="year" type="date" format="yyyy"/> <column id="population" type="integer"/> <data> <file format="csv" encoding="utf-8">gender_country_slice.csv</file> </data> </table>
The CSV file for the table looks like this:
country, gender, year, population AF, M, 1960, 4808176 AF, F, 1960, 4808177 AF, M, 1961, 4899689 AF, F, 1961, 4899690...
Compared to the previous countries, population, and unemployment slice, this one has an additional dimension; each value of the population metric corresponds not only to a particular country and year, but also to a particular gender.
Note that we have created a "sparse" dataset. Not all metrics are available for all dimensions: population is available for countries and US states, on a yearly basis, while unemployment rate is only available for countries. The breakdown by gender is available for population by country only; it is not available for the unemployment rate metric, and not for the state dimension. Sparsity can also exist at the data level, with certain metrics not having values for certain dimension values, but that's not represented in DSPL.
Topics
The last feature of DSPL we will use in our dataset is topics. Topics are used to classify concepts hierarchically, and are used by applications to help users navigate to your data.
In the DSPL file, topics appear right before concepts. Here is a sample topic hierarchy:
<dspl ... > ... <topics> <topic id="geography"> <info> <name> <value>Geography</value> </name> </info> </topic> <topic id="social_indicators"> <info> <name> <value>Social indicators</value> </name> </info> </topic> <topic id="population_indicators"> <info> <name> <value>Population indicators</value> </name> </info> </topic> <topic id="poverty_and_income"> <info> <name> <value>Poverty & income</value> </name> </info> </topic> <topic id="health"> <info> <name> <value>Health</value> </name> </info> </topic> </topics>
You can nest topics as deeply as necessary.
To use topics, you just need to reference them from the concept definition, as follows:
<concept id="population"> <info> <name> <value>Population</value> </name> <description> <value>Size of the resident population.</value> </description> </info> <topic ref="population_indicators"/> <type ref="integer"/> </concept>
A concept may reference more than one topic.
Submitting Your Dataset
Now that you've created your dataset, the next step is to zip it and upload the zip file to the Google Public Data Explorer tool. If you encounter any problems, check the FAQ, which includes a discussion of the most common uploading issues.
For reference, you can also download the complete XML file and complete dataset bundle associated with this tutorial.
Where to Go From Here
Congratulations on creating your first DSPL dataset! Now that you understand the basics, we recommend reading through the Developer Guide, which, among other things, documents "advanced" DSPL features like multi-language support and mappable concepts.
You may also want to look at some more example datasets.