DSPL Check

DSPL Check is a utility that validates a DSPL dataset against a number of criteria including adherence to the official DSPL schema, consistency of internal references, and CSV file structure. The utility can catch many problems that will cause DSPL import errors, helping you to detect and fix these problems quickly before beginning the input process.

Note that the utility doesn't (yet) check your DSPL dataset for every possible problem. However, it will catch the most common issues, so if your dataset is successfully validated by the tool, there is a strong chance it will be importable and visualizable in the Public Data Explorer. See the Checking Details section below for more information.

Running DSPL Check

Basics

Note: These directions assume that you have already followed the installation instructions given on the DSPL Tools page.

To run DSPL Check, go to the terminal / prompt on your system and type:

python dsplcheck.py [path to dataset XML or zip file]

where the bracketed term is replaced with the relative path to either a dataset XML file or zipped DSPL bundle.

If the dataset is valid, the tool prints out a "validation successful" message. Otherwise, it outputs one or more error messages describing why the validation failed. If the latter occurs, fix your dataset as directed, and then run the tool again.

Checking Level

By default, DSPL Check will examine the entire dataset, including the CSVs referenced from the main, DSPL XML file. This process works well on small to medium sized datasets, but may get bogged down or run out of memory on datasets that are very large (i.e., in the hundreds of megabytes or bigger).

To address these cases, the tool has a checking level option that allows you to set the scope of the checking and improve performance, as needed. To use, insert --checking_level=[...] before the dataset path, where the bracketed term is replaced by one of the following values:

  • schema_only: Validate the dataset XML file against the official DSPL schema, then stop.
  • schema_and_model: Do schema and basic model validation, but ignore CSV content after the header line.
  • full: Do schema, model, and data validation (default).

Checking Details

DSPL Check performs the following sequence of validations:

  • XML schema validation: Verifies that your dataset metadata file is valid XML and conforms to the official DSPL schema.
  • CSV existence: Checks that all of the CSV files referenced from your dataset exist and are loadable.
  • Concept checks: Various checks of each concept in your dataset, including:
    • Dataset has at least one concept*
    • All topic references are valid
    • Table reference exists if concept is used as a non-time dimension*
    • Table reference is valid if present
    • Referenced table has a column corresponding to concept ID
  • Slice checks: Various checks of each slice in your dataset, including:
    • Dataset has at least one slice*
    • At least one slice references a non-time dimension*
    • Slice has at least one metric and one dimension
    • Exactly one dimension references time canonical concept*
    • Each slice has a unique combination of dimensions
    • All references to local concepts are valid
    • Table reference exists
    • Table reference is valid
    • Referenced table has a column for each dimension and metric in the slice
    • Column types in the referenced table match the types of the concepts used in the slice
  • Table checks: Various checks of each table in your dataset, including:
    • Dataset has at least one table*
    • CSV file has same number of columns as table
    • CSV header strings match column IDs
    • All date columns have a format attribute
    • Date formats align (roughly) with the associated time concepts, e.g., the format for a time:year column includes at least one y character*
  • CSV data checks: Various checks of the CSV data files referenced by your dataset XML file, including:
    • Each CSV row has the same number of columns as its header
    • Concept definition CSV has no more than one row for each concept ID
    • Slice CSV has no more than one row for each combination of dimensions
    • Dimension values referenced in slice CSV are valid
    • Slice CSV is properly sorted
    • Integer and float CSV values are properly formatted

Criteria marked by a * are necessary for visualization in the Public Data Explorer, but are technically not required by the DSPL format.

On the other hand, the tool does not (yet) look at the following:

  • Dataset imports
  • Attribute and property references
  • Concept extensions