Datasets

The web contains specialized repositories for datasets in many scientific domains: life sciences, earth sciences, material sciences, and more. Similarly, many governments maintain repositories of civic and government data. However, much of that structured data is not readily available to search engines, which must extract the data from HTML pages in order to provide search services to users. When webmasters provide structured markup, they enable search engines to “understand” this metadata, which in turn improves data discovery, leading scientists to the information they need for their work.

For example, consider this dataset that describes historical snow levels in the Northern Hemisphere. This page contains basic information about the data, like spatial coverage and units. Other pages on the site contain additional metadata: who produces the dataset, how to download it, and the license for using the data. With structured data markup, these pages can be more easily discovered by other scientists searching for climate data in that subject area.

Help Google improve dataset discovery efforts by filling out our interest form. This is not a technically required action, but if you provide your contact information, we can reach out to you to learn more about your field of study.

Express Interest

What qualifies as a dataset?

For purposes of inclusion, we take a broad view of what qualifies as a dataset:

  • A table or a CSV file with some data
  • A file in a proprietary format that contains data
  • A collection of files that together constitute some meaningful dataset
  • A structured object with data in some other format that you might want to load into a special tool for processing
  • Images capturing the data
  • Anything that looks like a dataset to you

Mark up your dataset descriptions

This section highlights the properties commonly seen in descriptions of datasets and provides examples of how to use them correctly in markup. Here the focus is on describing information about a dataset (its metadata) and representing its contents. For example, dataset metadata states what the dataset is about, which variables it measures, who created it, and so on. It does not, for example, contain specific values for the variables.

When describing dataset metadata, start with Basic dataset properties and add other properties that apply to your specific dataset as described in the sections that follow. For a complete list of all the dataset properties and their definitions, see Dataset definition in schema.org.

Basic dataset properties

This table shows only the required basic properties.

Properties
name Text

A descriptive name of a dataset (e.g., “Snow depth in Northern Hemisphere”)

description Text

A short summary describing a dataset.

url URL

Location of a page describing the dataset.

sameAs URL

Other URLs that can be used to access the dataset page.

version Text, Number

The version number for this dataset.

keywords Text

Keywords summarizing the dataset.

variableMeasured Text, PropertyValue

What does the dataset measure? (e.g., temperature, pressure)

creator.name Person, Organization

The name of the dataset creator (person or organization).

Data catalog properties

You often publish datasets as part of repositories that contain many other datasets. The same dataset can be included in more than one such repository. You can refer to a data catalog that this dataset belongs to by referencing it directly:

Properties
includedInDataCatalog DataCatalog

The catalog to which this dataset belongs to.

Download information properties

The DataDownload property describes how to get the dataset itself because the URL often points to the landing page describing the dataset. The "distribution" property describes where to get the data and in what format. This property can have several values: for instance, a CSV version has one URL and an Excel version is available at another.

Properties
distribution DataDownload

Description of the location for download of the dataset and the file format for download

distribution.fileFormat Text, recommended

The file format of this distribution

distribution.contentUrl URL, required

The link for the download.

Temporal coverage

The data in the dataset covers a specific time interval. Schema.org uses ISO 8601 standard to describe time intervals and time points. You can describe dates differently depending upon the dataset interval.

Single data

  "temporalCoverage" : "2016-01-01"

Time period

  "temporalCoverage" : "2015-11-01/2016-04-01"

Spatial coverage

You can provide a single point that describes the spatial aspect of the dataset (e.g., a single point where all the measurements were collected), or, for example, the coordinates of a bounding box for an area.

Points

  "spatialCoverage:" {
       "@type": "Place",
       "geo": {
         "@type": "GeoCoordinates",
         "latitude": 39.3280
         "longitude": 120.1633
       }
     }

Coordinates

Use GeoShape to describe areas of different shapes. For example, to specify a bounding box:

  "spatialCoverage:" {
       "@type": "Place",
       "geo": {
         "@type": "GeoShape",
         "box": "39.3280 120.1633 40.445 123.7878"
       }
     }

Named locations

  "spatialCoverage:" "Tahoe City, CA"

Citations and publications

A dataset can be referred to in a publication, so adding a citation to the publication is particularly helpful for discovery:

Properties
citation Text, CreativeWork

A citation for a publication that describes the dataset (e.g., “J.Smith 'How I created an awesome dataset’, Journal of Data Science, 1966”)

Provenance and license information

Finally, you can describe additional information about the publication of the dataset, such as the license, when it was published, etc. For the license, for example, we should specify the URL for the license:

Properties
license URL, Text

A license under which the dataset is distributed.

Site-wide structure: sitemaps and sameAs

Using sitemap files and sameAs markup helps document how dataset descriptions are published throughout your site. Using a sitemap is particularly important in cases where URLs cannot be reached without searching.

Use a sitemap file to help search engine crawlers find your URLs.

If you have a dataset repository, you likely have at least two types of pages: the canonical ("landing") pages for each dataset and pages that list multiple datasets (e.g. search results, or some subset of datasets). We recommend that you add structured data about a dataset to the canonical pages. If you also add structured data to pages with multiple datasets, use the sameAs property to link to the canonical page for your dataset. Google does not need every mention of the same dataset to be explicitly marked up, but if you do so for other reasons, we strongly encourage the use of sameAs.

Use a sameAs property to link to the canonical page if you add structured data to multiple copies of the dataset, such as listings in search results pages.

Complete Example: Datasets markup in JSON-LD

The following example is based on https://catalog.data.gov/dataset/ncdc-storm-events-database, a real-world dataset description.

The structure is very close to that used in the W3C DCAT specification. We expect to add a DCAT example in a future revision of these guidelines.

Source and Provenance Guidelines

It is common for open datasets to be republished, aggregated, and to be based on other datasets. This is an initial outline of our approach to representing situations in which a dataset is a copy of, or otherwise based upon, another dataset.

  • Use the sameAs property to indicate the most canonical url(s) for the original in cases when the dataset or description is a simple republication of materials published elsewhere.
  • Use the isBasedOn property in cases where the republished dataset (including its metadata) has been changed significantly.
  • When a dataset derives from or aggregates several originals, use the isBasedOn property.
  • Use the identifier property to attach any relevant Digital Object identifiers (DOIs).

We hope to improve our recommendations based on feedback, in particular around the description of provenance, versioning, and the dates associated with time series publication. Please join in community discussions at schema.org.

More information

For more information about our approach to dataset discovery, see Facilitating the discovery of public datasets.

Send feedback about...