DSPL Gen is a command-line utility that helps convert a CSV data table into a simple DSPL dataset. The tool parses the input file, creates all of the necessary data tables, and lays out a DSPL XML file. The result is a nearly-complete dataset "template" that requires only minor refinement before uploading and visualizing.
The fundamental assumption behind DSPL Gen is that each column in the input CSV file corresponds to a single DSPL concept and each row is a single observation for a combination of dimensions. The tool works by iterating through the columns, establishing the appropriate metadata for each column / concept (e.g., its ID, type, whether it is a dimension or a metric, etc.) and then generating all of the necessary concept definition and slice tables.
Running DSPL Gen
Note: These directions assume that you have already followed the installation instructions given on the DSPL Tools page.
To run DSPL Gen, go to the terminal / prompt on your system and type:
python dsplgen.py -o [output path] [path to CSV file]
where the bracketed terms are replaced as follows:
[output path]: Path to the directory where you'd like your output files to be placed; this directory must already exist in your file system.
[path to CSV file]: Path to your input CSV file. See the section below for more details on how to format this file.
-o [output path] component is optional; if omitted, the
output files will be placed in the current directory.
The tool then runs, iterating through your CSV file, extracting concept definitions and generating slice tables. Unless any errors are found, an XML file and one or more CSV files are written to the output directory.
Finally, open up the XML file in the text editor of your choice, and make
any desired changes. At a minimum, you should fill in the names,
descriptions, and other values that have been marked by the tool with the
** INSERT ... ** notation.
Suppose that we have a CSV that gives the total number of cases for a disease, broken out by month, country, and age group. For simplicity, let's just assume there are 2 values of each dimension:
date,country,age_group,disease_cases 01/2000,US,old,102030 01/2000,MX,old,192939 01/2000,US,young,23131 01/2000,MX,young,12311 02/2000,US,old,143135 02/2000,MX,old,203131 02/2000,US,young,22990 02/2000,MX,young,12591
Running the tool on the above input produces 4 files:
dataset.xml: The DSPL XML template
country_table.csv: Values of the
age_group_table.csv: Values of the
slice_0_table.csv: Table for dataset slice (identical to input table, except for sorting)
In this particular case, the tool is able to correctly guess all of the column types and formats. So, all that's left to do is go into the XML file and edit the various names, descriptions, and URLs in the dataset metadata.
Annotating Your Input CSV
In many cases, like the example above, DSPL Gen can guess the basic metadata for each column, e.g. the type, format, and whether it is a dimension or a metric. In other cases, however, the tool may not be able to guess these, may guess incorrectly, or may guess correctly but still, nonetheless, not follow the desired behavior.
To handle these cases, the tool supports the addition of DSPL generation options / "hints" in the CSV header row. The syntax for these is as follows:
In particular, note that the options list is placed inside brackets to the right of the column name, and the successive option settings are separated by semi-colons. Each option consists of an option name, an equals sign, and an option value.
DSPL Gen currently supports the following basic options:
||The DSPL data type for this column; must be one of
||Guessed from data|
||The format of this date column; as in DSPL, this format should be encoded using the Joda DateTime standard.||Guessed from data|
||What the role of this column is in any slices; must be either
||The canonical concept that this column represents, e.g.
||The canonical concept that this column extends, e.g.
The parent concept for this column. This parent must be a dimension concept defined inside the dataset (i.e., not imported); if given, this information will be used to define a concept hierarchy in the dataset.
Note that each child instance can only have one parent. If multiple parent values are detected then the tool will produce an error.
In addition to the basic, "hint" options described above,
DSPL Gen also supports slice roll-ups: given a single table
input, the tool can automatically create additional tables with limited
subsets of the dimensions. So, for example, if your input table corresponds to
a slice with three non-time dimensions (e.g.,
dimensions3) the tool can
create some additional slices as well: just
Adding these additional slices can make your dataset easier to explore since it reduces the number of dimension filters required to produce visualizations. On the other hand, this process may introduce "new" data not found in the original file, so it may also introduce errors if not used correctly; carefully check the rolled-up numbers before publishing your dataset.
The roll up process is triggered by including some extra options in the input CSV header:
||The aggregation method to be used when combining multiple values
of this column; must be one of
||If value is
||The value in this column that indicates rows to be kept when this column is rolled-up; only meaningful for dimension concepts.||
DSPL Gen then computes the roll-ups using the following algorithm:
- Using the values of the
rolluptags, compute all possible column combinations
- For each combination:
- If any rolled-up columns have a
total_valparameter set, then filter the rows by this value; otherwise, select all rows
- If any rolled-up columns have a
- Drop the rolled-up columns
- Group the rows by the values of the remaining dimension columns
- Aggregate each metric using the method specified in its
- Create a slice and associated table for the result
Note that if there is only a single value in each grouping and the
aggregating function is set to
SUM (the default value), then the
metric values in the source and generated tables will be the same, i.e. no new values
are generated. Thus, using the
total_val parameter and ensuring
that there is only one "total" row for each combination of the other
dimensions allows you to store pre-computed aggregates directly in the source.
Warning: The number of generated slices is equal to 2rolled up columns. Applying this option to a large number of columns may result in very long run times and/or large datasets.
Suppose that we have the same input data as in the previous example, but now including aggregates across countries:
date,country,age_group,disease_cases 01/2000,US,old,102030 01/2000,MX,old,192939 01/2000,total,old,152012 01/2000,US,young,23131 01/2000,MX,young,12311 01/2000,total,young,241011 02/2000,US,old,143135 02/2000,MX,old,203131 02/2000,total,old,211310 02/2000,US,young,22990 02/2000,MX,young,12591 02/2000,total,young,240131
The total for each age group may not match the sum of the country values because, for instance, of differences in methodology or the existence of other countries that aren't enumerated in the list.
We now want to make a DSPL dataset that has the following additional features:
- "Roll up" the countries so that we get the total disease cases by age group alone
- Use the canonical
geo:countryconcept for the
countrycolumn, so that we can automatically get the full names, locations, etc. for these
- Make the
quantity:amount, so that we can (manually) add unit information into the XML file
To do this, we modify the header row of the input CSV as follows:
Running the tool again, we now get an additional slice table, one that eliminates the distinction by country:
date,age_group,disease_cases 01/2000,old,152012 02/2000,old,211310 01/2000,young,241011 02/2000,young,240131
Note that if we hadn't included the total values and the
total_val tag in the source table, then we would have gotten the
calculated sum for each age group.
In addition, the definition of
disease_cases in the XML file
now includes the extension from
quantity:amount, as desired. The
country is dropped from the XML file because
we're using an imported concept for this instead.
After filling in the various names, descriptions, and unit information, we can zip the dataset bundle and upload it to the Public Data Explorer for visualization.