DSPL Gen

DSPL Gen is a command-line utility that helps convert a CSV data table into a simple DSPL dataset. The tool parses the input file, creates all of the necessary data tables, and lays out a DSPL XML file. The result is a nearly-complete dataset "template" that requires only minor refinement before uploading and visualizing.

The fundamental assumption behind DSPL Gen is that each column in the input CSV file corresponds to a single DSPL concept and each row is a single observation for a combination of dimensions. The tool works by iterating through the columns, establishing the appropriate metadata for each column / concept (e.g., its ID, type, whether it is a dimension or a metric, etc.) and then generating all of the necessary concept definition and slice tables.

Running DSPL Gen

Syntax

Note: These directions assume that you have already followed the installation instructions given on the DSPL Tools page.

To run DSPL Gen, go to the terminal / prompt on your system and type:

python dsplgen.py -o [output path] [path to CSV file] 

where the bracketed terms are replaced as follows:

  • [output path]: Path to the directory where you'd like your output files to be placed; this directory must already exist in your file system.
  • [path to CSV file]: Path to your input CSV file. See the section below for more details on how to format this file.

The -o [output path] component is optional; if omitted, the output files will be placed in the current directory.

The tool then runs, iterating through your CSV file, extracting concept definitions and generating slice tables. Unless any errors are found, an XML file and one or more CSV files are written to the output directory.

Finally, open up the XML file in the text editor of your choice, and make any desired changes. At a minimum, you should fill in the names, descriptions, and other values that have been marked by the tool with the ** INSERT ... ** notation.

Simple Example

Suppose that we have a CSV that gives the total number of cases for a disease, broken out by month, country, and age group. For simplicity, let's just assume there are 2 values of each dimension:

date,country,age_group,disease_cases
01/2000,US,old,102030
01/2000,MX,old,192939
01/2000,US,young,23131
01/2000,MX,young,12311
02/2000,US,old,143135
02/2000,MX,old,203131
02/2000,US,young,22990
02/2000,MX,young,12591

Running the tool on the above input produces 4 files:

  • dataset.xml: The DSPL XML template
  • country_table.csv: Values of the country concept
  • age_group_table.csv: Values of the age_group concept
  • slice_0_table.csv: Table for dataset slice (identical to input table, except for sorting)

In this particular case, the tool is able to correctly guess all of the column types and formats. So, all that's left to do is go into the XML file and edit the various names, descriptions, and URLs in the dataset metadata.

Annotating Your Input CSV

Basic Options

In many cases, like the example above, DSPL Gen can guess the basic metadata for each column, e.g. the type, format, and whether it is a dimension or a metric. In other cases, however, the tool may not be able to guess these, may guess incorrectly, or may guess correctly but still, nonetheless, not follow the desired behavior.

To handle these cases, the tool supports the addition of DSPL generation options / "hints" in the CSV header row. The syntax for these is as follows:

column1[option1=value1;option2=value2;...],column2[option1=value1;option2=value2;...],column3[...],...

In particular, note that the options list is placed inside brackets to the right of the column name, and the successive option settings are separated by semi-colons. Each option consists of an option name, an equals sign, and an option value.

DSPL Gen currently supports the following basic options:

Option Description Default
type The DSPL data type for this column; must be one of date, float, integer, or string. Guessed from data
format The format of this date column; as in DSPL, this format should be encoded using the Joda DateTime standard. Guessed from data
slice_role What the role of this column is in any slices; must be either dimension or metric. dimension if column is of type date or string, otherwise metric
concept The canonical concept that this column represents, e.g. geo:country; if given, then the concept is imported and no definition is produced in the XML file. None
extends The canonical concept that this column extends, e.g. entity:entity. None
parent

The parent concept for this column. This parent must be a dimension concept defined inside the dataset (i.e., not imported); if given, this information will be used to define a concept hierarchy in the dataset.

Note that each child instance can only have one parent. If multiple parent values are detected then the tool will produce an error.

None

Value Roll-ups

In addition to the basic, "hint" options described above, DSPL Gen also supports slice roll-ups: given a single table input, the tool can automatically create additional tables with limited subsets of the dimensions. So, for example, if your input table corresponds to a slice with three non-time dimensions (e.g., dimension1, dimension2, and dimensions3) the tool can create some additional slices as well: just dimension1, just dimension2, just dimension1 and dimension2, etc.

Adding these additional slices can make your dataset easier to explore since it reduces the number of dimension filters required to produce visualizations. On the other hand, this process may introduce "new" data not found in the original file, so it may also introduce errors if not used correctly; carefully check the rolled-up numbers before publishing your dataset.

The roll up process is triggered by including some extra options in the input CSV header:

Option Description Default
aggregation The aggregation method to be used when combining multiple values of this column; must be one of AVG, COUNT, MAX, MIN, or SUM; only meaningful for metrics. SUM
rollup If value is true, the concept for this column is "rolled up" (i.e., aggregated away); only meaningful for dimension concepts. False
total_val The value in this column that indicates rows to be kept when this column is rolled-up; only meaningful for dimension concepts. None

DSPL Gen then computes the roll-ups using the following algorithm:

  • Using the values of the rollup tags, compute all possible column combinations
  • For each combination:
    • If any rolled-up columns have a total_val parameter set, then filter the rows by this value; otherwise, select all rows
    • Drop the rolled-up columns
    • Group the rows by the values of the remaining dimension columns
    • Aggregate each metric using the method specified in its aggregation parameter
    • Create a slice and associated table for the result

Note that if there is only a single value in each grouping and the aggregating function is set to SUM (the default value), then the metric values in the source and generated tables will be the same, i.e. no new values are generated. Thus, using the total_val parameter and ensuring that there is only one "total" row for each combination of the other dimensions allows you to store pre-computed aggregates directly in the source.

Warning: The number of generated slices is equal to 2rolled up columns. Applying this option to a large number of columns may result in very long run times and/or large datasets.

Advanced Example

Suppose that we have the same input data as in the previous example, but now including aggregates across countries:

date,country,age_group,disease_cases
01/2000,US,old,102030
01/2000,MX,old,192939
01/2000,total,old,152012
01/2000,US,young,23131
01/2000,MX,young,12311
01/2000,total,young,241011
02/2000,US,old,143135
02/2000,MX,old,203131
02/2000,total,old,211310
02/2000,US,young,22990
02/2000,MX,young,12591
02/2000,total,young,240131

The total for each age group may not match the sum of the country values because, for instance, of differences in methodology or the existence of other countries that aren't enumerated in the list.

We now want to make a DSPL dataset that has the following additional features:

  1. "Roll up" the countries so that we get the total disease cases by age group alone
  2. Use the canonical geo:country concept for the country column, so that we can automatically get the full names, locations, etc. for these
  3. Make the disease_cases concept extend quantity:amount, so that we can (manually) add unit information into the XML file

To do this, we modify the header row of the input CSV as follows:

date,country[concept=geo:country;rollup=true;total_val=total],age_group,disease_cases[extends=quantity:amount]

Running the tool again, we now get an additional slice table, one that eliminates the distinction by country:

date,age_group,disease_cases
01/2000,old,152012
02/2000,old,211310
01/2000,young,241011
02/2000,young,240131

Note that if we hadn't included the total values and the total_val tag in the source table, then we would have gotten the calculated sum for each age group.

In addition, the definition of disease_cases in the XML file now includes the extension from quantity:amount, as desired. The definition of country is dropped from the XML file because we're using an imported concept for this instead.

After filling in the various names, descriptions, and unit information, we can zip the dataset bundle and upload it to the Public Data Explorer for visualization.