DSPL: Dataset Publishing Language

DSPL FAQ

This document covers the most frequent issues experienced by data owners when creating DSPL datasets and uploading these into the Public Data Explorer.

Contents

General Questions

What is DSPL?

DSPL stands for Dataset Publishing Language. It is a representation format for both the metadata (information about the dataset, such as its name and provider, as well as the concepts it contains and displays) and actual data of datasets. The metadata is specified in XML, whereas the data are provided in CSV format.

What are the key advantages of using DSPL?

DSPL is designed from the ground up for rich data visualizations like those in the Public Data Explorer. Creating these requires detailed metadata around slices, dimensions, and metrics, entities that are not as well supported in other dataset formats.

DSPL also supports dataset imports, concept hierarchies (e.g., "country" is the child of "continent"), geocoded data, and a number of other unique features that enhance the data exploration experience.

Is DSPL a replacement for other formats used for data exchange and/or analysis?

Generally not. As noted in the previous answer, DSPL is designed for interactive visualization and exploration. It is not intended as a generic, do-all data interchange or analysis format.

Ultimately, we view DSPL as being complementary to other formats. Users should be able to create DSPL datasets from other sources for the purpose of creating rich, interactive data visualizations.

What can I do with a DSPL dataset?

You can import it into the Public Data Explorer, publish it, and allow others to explore the data via rich, interactive visualizations. Published datasets can also be included in the Public Data Directory so that interested users can find them.

Currently, this is the only application using DSPL. However, we encourage people to use it for other applications, and we expect that adoption will grow over time.

What types of datasets are most appropriate for DSPL?

The DSPL format supports arbitrary collections of tables, and is thus appropriate for a wide variety of dataset types. Only a subset of DSPL datasets, however, will produce interesting visualizations in the Public Data Explorer. The latter product, in particular, works best for data that are:

  • Quantitative: Each data point has one or more numeric metrics associated with it (e.g., "population", "number of flu cases", "revenue").
  • Categorical: Data can be organized into a finite number of text-describable categories (e.g., "countries", "genders", "age groups").
  • Time series: For each category, the data metrics vary as a function of time, and adjacent points are at least one day apart (the Public Data Explorer cannot visualize time increments smaller than a day).
  • Aggregated: For each time / category / metric combination, there is a single data point, not a list of events or facts.

I've created a DSPL dataset, and I'd like for it to appear in the Google Public Data Directory so that others can find it. Whom do I contact?

Please fill out this form, and provide a link to your dataset.

I'm having trouble with DSPL. Where do I go for help?

Please post your issue on the DSPL discussion forum.

DSPL Dataset Files

How should I encode my XML and CSV files?

All XML and CSV files need to be UTF-8 encoded. Note that ASCII (sometimes referred to as "plain text") is a subset of UTF-8, so datasets in that format should work as well.

What software should I use to create and edit my dataset files?

A plain text editor, with syntax highlighting for readability purposes, is the recommended choice for editing your XML files; see this article for some platform-specific recommendations. We advise against using fully-featured, general-purpose word processors as these tend to insert additional formatting tags into your XML, which can cause import errors.

A spreadsheet is usually the easiest way to create and edit your data files. Just be sure to save them in the correct format (CSV / comma-separated values).

I have data in Excel, SPSS, SAS, or some other system. Can I import these directly into the Public Data Explorer?

No, not at this time. You need to first export your data to CSV format, add the appropriate XML metadata, and then upload a DSPL-compliant dataset into the Public Data Explorer.

Does it matter what I name my files?

Your dataset XML file should have a name that ends in .xml. The associated CSV data files can have any names, provided that they match the names given in the <file> tags in your XML metadata. The zip file used to package and import the dataset into the Public Data Explorer can also have any name.

Should my CSV files be sorted?

Yes. You should sort the content of your CSV files by the non-time dimensions (in any order or direction) and then, optionally, by any of the other columns (e.g., time).

So, for example, if you have a CSV with the columns date, dimension1, dimension2, metric1, and metric2, then you should sort by dimension1 and dimension2 (in any order). If you'd like to also sort by the date/time column, then this should be the last thing you sort by.

Sorting in this way keeps the observations for each time series grouped together, which greatly improves the efficiency of the DSPL import process.

XML Model and Syntax

How do I decide what should be a metric and what should be a dimension?

A dimension is an entity that is used to segment or filter your data. A metric, on the other hand, describes the observed value or values associated with each data point.

Generally, dimensions are categorical whereas metrics are non-categorical, time-varying, numeric values. Some prototypical examples of each are as follows:

  • Dimensions: Country, state, county, region, year, month, sex, age category, industry segment
  • Metrics: Population, GDP, unemployment rate, literacy, revenue, cost, price

What is the difference between a property and an attribute?

Properties are attached to each instance of a concept. For example, a continent property will have different values for different countries. Attributes, on the other hand, are associated with the concept as a whole. For example: an isParent attribute is true for all continents.

Does the order of tags matter?

Yes. Add your tags in the order in which they appear in the Developer Guide. For example, <topic> should appear before <type> in the definition of a concept.

Does capitalization matter?

Yes, your XML tag and attribute names need to be capitalized in the same manner as they appear in the Developer Guide. For example, using isparent instead of isParent in a property tag will cause an import error.

Can a concept have two parents?

No. Each concept can have only one isParent reference.

Can a concept refer to itself?

Yes. See the US Retail Sales dataset for an example of a self-referencing concept hierarchy.

Data Formatting

How do I format dates?

Dates can be written in any format that describable with the Joda DateTime standard. The Joda formatting code should be stored in a format attribute within the corresponding table column element.

The Joda formatting codes for some popular date formats are listed below:

Date Example Joda Format
2010 yyyy
May 2010 MMM yyyy
05/21/2010 MM/dd/yyyy
21/05/2010 dd/MM/yyyy
2010-05-21 yyyy-MM-dd

In particular, note that the Joda code for month characters is M, not m (which represents minutes).

Can I use time units smaller than one day?

The Joda DateTime format, and hence DSPL as well, supports time values down to the order of milliseconds. The Public Data Explorer, however, cannot (yet) visualize any time granularities smaller than a day.

Using Canonical Concepts

What are "canonical concepts" and how are they useful?

The term "canonical concepts" refers to a set of Google-created concepts that are intended as basic "building blocks" in other datasets. The concepts themselves are defined across six DSPL datasets that group the former into categories such as "time", "geo", etc. To get access to these concepts, just import the appropriate parent dataset(s) at the beginning of your DSPL XML file.

Canonical concepts are useful because they help save time (e.g., by not having to manually enter latitude and longitude values for every country in the world) and also signal how your data are to be visualized. For instance, the Public Data Explorer uses the time:... concepts to format the line chart x-axis, uses the name property of the entity:entity concept to produce strings for the dimension picker UI, uses the latitude and longitude properties of geo:location to display data on in the map visualization, and so forth.

Are all the canonical concepts understood by the Public Data Explorer?

While most of the provided canonical concepts are understood by the Public Data Explorer, there are a few that are not (yet) visualizable. These are listed below, along with some suggested workarounds:

Concept Workaround
quantity:index Use quantity:ratio or quantity:magnitude instead.
time:quarter Use time:month as described in the DSPL Cookbook.
time:week Use time:day as described in the DSPL Cookbook.

Stay tuned for better support of these concepts in the future.

How do I use a canonical concept in my dataset?

See the documentation for the specific concept you'd like to use, and also check out the DSPL Cookbook, which has detailed, step-by-step directions for the most common ones.

Importing and Visualizing Datasets

Why can't I import my dataset successfully?

The Public Data Explorer's upload interface will scan your DSPL dataset and block its import if any errors are detected. The importer is very sensitive to spelling, capitalization, and tag order / placement in your XML file, as well as the layout and sorting of data in your CSV files, so it may take a few passes to get these things right and import your dataset successfully.

The first step in resolving these issues is to look at the error message(s) given in the UI and take the appropriate corrective action. Since these messages are not always the easiest to understand (something we're actively working on improving), we've compiled a table that explains the most common ones:

Error Explanation
duplicate key: ... The definition table for your concept has a repeated ID value (i.e., value in the column with the same name as the concept). These values are used to uniquely identify individual instances of the concept, so duplicates aren't allowed.
Exception in parsing data rows from source caused by The combination of properties, [...], appears in more than one distinct group of rows in the data. Your CSV is not sorted properly. See the discussion above for instructions on how to do this.
Exception in parsing data rows from source caused by Invalid format: "..." is malformed at "..." The formatting of this value (typically a date) in your CSV is not consistent with the format given in your XML file. Change the format or the value so that they match.
Exception in parsing data rows from source caused by Number of elements in line (...) did not match number of specified properties (...) for line: [...] A row in your CSV has either too many or too few values. Fix the formatting of this row.
Exception in parsing data rows from source caused by For input string: "..." A value in your CSV (typically an integer or float) has non-numeric characters in it (e.g., a dollar symbol, a percentage sign, etc.) that prevent it from being properly parsed. Remove these extra characters.
Exception in parsing data rows from source caused by Data value '...' for property '...' of Slice '...' is not a key value of the referenced Concept '...'. One of your slices contains an unrecognized dimension value (i.e., one that isn't in the list of all possible values for the corresponding concept). Go back to the dimension concept definition table and add the value, if necessary.
Header '...' in data is a constant property in table The column header in the CSV does not match the column ID defined in the XML table definition. Change one or the other so that they match.
XML parsing error ... Invalid content was found starting with element '...'. One of '{...}', '{...}', ... is expected. The referenced XML element is not in the right place. Check to make sure the order is correct, and also that the element has the correct parent (e.g., info for name).
XML parsing error ... Attribute '...' is not allowed to appear in element '...'. The spelling, case, or location of this XML tag attribute is incorrect. Check the documentation for the appropriate usage.
XML parsing error. ... Element '...' cannot have character [children], because the type's content type is element-only. There's some stray text in your XML file (potentially caused by a tag that's missing a < or >). Fix the text and try again.

If you have trouble understanding a message not in the above list, please post a message in the DSPL forum, and we'll try to help.

My dataset imports successfully, but I can't get any visualizations to show up in the Public Data Explorer. What's going on?

This problem occurs when your dataset is valid DSPL, but is not in the subset of DSPL that is visualizable in the Public Data Explorer. There are many possible causes for this; the most common are:

  • Defining a dimension concept without a table: Without this information, the Public Data Explorer does not know what choices to display in the UI.
  • Creating a dataset with only metrics: The Public Data Explorer requires at least one categorical (i.e., non-time) dimension defined somewhere in the dataset to properly structure the visualization UI.
  • Not including a time dimension in your slices: The Public Data Explorer can only visualize time series. Non-time slices will be ignored by the product.
  • Using a time dimension other than the canonical time:... ones: The Public Data Explorer uses the canonical time concepts for laying out and animating the various visualizations in the product; it does not understand other time concepts, e.g. ones created inside your own dataset.
  • Using time values that are too big or too small: The Public Data Explorer does not yet visualize datasets with time granularities smaller than one day. On the other end of the spectrum, the tool has trouble with very large year values (e.g., in the tens of thousands). We hope to make these granularities more flexible in the future.

How do I integrate my visualized dataset into my web site?

See this article in the Public Data Explorer Help Center. As explained in the latter, you can get a "full embed" (i.e., one including the exploration controls) by manually adjusting the embed URL.

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.