Datasets are easier to find when you provide supporting information such as their name, description, creator and distribution formats as structured data. Google's approach to dataset discovery makes use of schema.org and other metadata standards that can be added to pages that describe datasets. The purpose of this markup is to improve discovery of datasets from fields such as life sciences, social sciences, machine learning, civic and government data, and more. You can find datasets by using the Dataset Search tool.

Here are some examples of what can qualify as a dataset:
- A table or a CSV file with some data
- An organized collection of tables
- A file in a proprietary format that contains data
- A collection of files that together constitute some meaningful dataset
- A structured object with data in some other format that you might want to load into a special tool for processing
- Images capturing data
- Files relating to machine learning, such as trained parameters or neural network structure definitions
- Anything that looks like a dataset to you
Our approach to dataset discovery
We can understand structured data in Web pages about datasets, using either schema.org Dataset markup, or equivalent structures represented in W3C's Data Catalog Vocabulary (DCAT) format. We also exploring experimental support for structured data based on W3C CSVW, and expect to evolve and adapt our approach as best practices for dataset description emerge. For more information about our approach to dataset discovery, see Making it easier to discover datasets.
Examples
Here's an example for datasets using JSON-LD syntax (preferred) in the Structured Data Testing Tool. The same vocabulary can also be used in RDFa 1.1, Microdata, or W3C DCAT vocabulary. The following example is based on a real-world dataset description.
Here's an example of a dataset in JSON-LD:
Here's an example of a dataset in RDFa:
Guidelines
Sites should follow the structured data guidelines. In addition to the structured data guidelines, we recommend the following sitemap and source and provenance best practices listed below.
Sitemap best practices
Use a sitemap file to help
Google find your URLs. Using sitemap files and sameAs
markup helps document how
dataset descriptions are published throughout your site.
If you have a dataset repository, you likely have at least two types of pages: the canonical
("landing") pages for each dataset and pages that list multiple datasets (for example, search
results, or some subset of datasets). We recommend that you add structured data about a dataset to
the canonical pages. Use the sameAs
property to link to the canonical page if you add structured data to multiple copies of the
dataset, such as listings in search results pages.
Source and provenance best practices
It is common for open datasets to be republished, aggregated, and to be based on other datasets. This is an initial outline of our approach to representing situations in which a dataset is a copy of, or otherwise based upon, another dataset.
- Use the
sameAs
property to indicate the most canonical URLs for the original in cases when the dataset or description is a simple republication of materials published elsewhere. The value ofsameAs
needs to unambiguously indicate the dataset's identity - in other words two different datasets should not use the same URL assameAs
value. - Use the
isBasedOn
property in cases where the republished dataset (including its metadata) has been changed significantly. - When a dataset derives from or aggregates several originals, use the
isBasedOn
property. - Use the
identifier
property to attach any relevant Digital Object identifiers (DOIs) or Compact Identifiers. If the dataset has more than one identifier, repeat theidentifier
property. If using JSON-LD, this is represented using JSON list syntax.
We hope to improve our recommendations based on feedback, in particular around the description of provenance, versioning, and the dates associated with time series publication. Please join in community discussions.
Textual property recommendations
We recommend limiting all textual properties to 5000 characters or less. Google Dataset Search only uses the first 5000 characters of any textual property. Names and titles are typically a few words or a short sentence.
Known Errors and Warnings
You may experience errors or warnings in Google's Structured
Data Testing Tool and other validation systems. Specifically, validation systems may
suggest that organizations should have contact information including a contactType
; useful values include
customer service
, emergency
, journalist
, newsroom
, and public engagement
.
You can also ignore errors for csvw:Table
being an unexpected value for the mainEntity
property.
Structured data type definitions
You must include the required properties for your content to be eligible for display as a rich result. You can also include the recommended properties to add more information about your content, which could provide a better user experience.
You can use the Structured Data Testing Tool to validate your markup.
The focus is on describing information about a dataset (its metadata) and representing its contents. For example, dataset metadata states what the dataset is about, which variables it measures, who created it, and so on. It does not, for example, contain specific values for the variables.
Dataset
The full definition of Dataset
is available at
schema.org/Dataset.
You can describe additional information about the publication of the dataset, such as the
license, when it was published, its
DOI,
or a sameAs
pointing to a canonical version of the dataset in a different
repository. Add identifier
, license
, and sameAs
for
datasets that provide provenance and license information.
Required properties | |
---|---|
description
|
Text
A short summary describing a dataset. Guidelines
|
name
|
Text
A descriptive name of a dataset. For example, "Snow depth in Northern Hemisphere". |
Recommended properties | |
---|---|
alternateName
|
Text
Alternative names that have been used to refer to this dataset, such as aliases or abbreviations. Example (in JSON-LD format): "name": "The Quick, Draw! Dataset" "alternateName": ["Quick Draw Dataset", "quickdraw-dataset"] |
creator
|
Person or
Organization
The creator or author of this dataset. To uniquely identify individuals, use
ORCID ID as the value of the "creator": [ { "@type": "Person", "sameAs": "http://orcid.org/0000-0000-0000-0000", "givenName": "Jane", "familyName": "Foo", "name": "Jane Foo" }, { "@type": "Person", "sameAs": "http://orcid.org/0000-0000-0000-0001", "givenName": "Jo", "familyName": "Bar", "name": "Jo Bar" }, { "@type": "Organization", "sameAs": "http://ror.org/xxxxxxxxx", "name": "Fictitious Research Consortium" } ] |
citation
|
Text or CreativeWork
Identifies academic articles that are recommended by the data provider be cited in addition to the
dataset itself. Provide the citation for the dataset itself with other properties, such as "citation": "https://doi.org/10.1111/111" "citation": "https://identifiers.org/pubmed:11111111" "citation": "https://identifiers.org/arxiv:0111.1111v1" "citation": "Doe J (2014) Influence of X ... https://doi.org/10.1111/111" Additional guidelines
|
hasPart or isPartOf
|
URL or
Dataset
If the dataset is a collection of smaller datasets, use the "hasPart" : [ { "@type": "Dataset", "name": "Sub dataset 01", "description": "Informative description of the first subdataset...", "license" : "https://creativecommons.org/publicdomain/zero/1.0/" }, { "@type": "Dataset", "name": "Sub dataset 02", "description": "Informative description of the second subdataset...", "license" : "https://creativecommons.org/publicdomain/zero/1.0/" } ] "isPartOf" : "https://example.com/aggregate_dataset" |
identifier
|
URL , Text , or PropertyValue
An identifier, such as a DOI or a Compact Identifier. If the dataset has more than one
identifier, repeat the |
keywords
|
Text
Keywords summarizing the dataset. |
license
|
URL , CreativeWork
A license under which the dataset is distributed. For example: "license" : "https://creativecommons.org/publicdomain/zero/1.0/" "license" : { "@type": "CreativeWork", "name": "Custom license", "url": "https://example.com/custom_license" } Additional guidelines
|
sameAs
|
URL
URL of a reference Web page that unambiguously indicates the dataset's identity, usually in a different repository. |
spatialCoverage |
Text , Place
You can provide a single point that describes the spatial aspect of the dataset. Only include this property if the dataset has a spatial dimension. For example, a single point where all the measurements were collected, or the coordinates of a bounding box for an area. Points "spatialCoverage:" { "@type": "Place", "geo": { "@type": "GeoCoordinates", "latitude": 39.3280, "longitude": 120.1633 } } Shapes Use GeoShape to describe areas of different shapes. For example, to specify a bounding box. "spatialCoverage:" { "@type": "Place", "geo": { "@type": "GeoShape", "box": "39.3280 120.1633 40.445 123.7878" } } Points inside Named locations "spatialCoverage:" "Tahoe City, CA" |
temporalCoverage |
Text
The data in the dataset covers a specific time interval. Only include this property if the
dataset has a temporal dimension. Schema.org uses the ISO 8601 standard
to describe time intervals and time points. You can describe dates differently depending
upon the dataset interval. Indicate open-ended intervals with two decimal points ( Single date "temporalCoverage" : "2008" Time period "temporalCoverage" : "1950-01-01/2013-12-18" Open-ended time period "temporalCoverage" : "2013-12-19/.." |
variableMeasured
|
Text , PropertyValue
The variable that this dataset measures. For example, temperature or pressure. |
version
|
Text , Number
The version number for the dataset. |
url
|
URL
Location of a page describing the dataset. |
DataCatalog
The full definition of DataCatalog
is available at
schema.org/DataCatalog.
Datasets are often published in repositories that contain many other datasets. The same dataset can be included in more than one such repository. You can refer to a data catalog that this dataset belongs to by referencing it directly.
Recommended properties | |
---|---|
includedInDataCatalog
|
DataCatalog
The catalog to which the dataset belongs.
|
DataDownload
The full definition of DataDownload
is available at
schema.org/DataDownload. In addition to Dataset properties,
add the following properties for datasets that provide download options.
The distribution
property describes how to get the dataset itself because the URL
often points to the landing page describing the dataset. The distribution
property describes where to get the data and in what format. This property can
have several values: for instance, a CSV version has one URL and an Excel
version is available at another.
Required properties | |
---|---|
distribution.contentUrl
|
URL
The link for the download. |
Recommended properties | |
---|---|
distribution
|
DataDownload
The description of the location for download of the dataset and the file format for download.
|
distribution.encodingFormat
|
Text ,
URL
The file format of the distribution.
|
Tabular datasets
A tabular dataset is one organized primarily in terms of a grid of rows and columns. For pages that embed tabular datasets, you can also create more explicit markup, building on the basic approach described above. At this time we understand a variation of CSVW ("CSV on the Web", see W3C), provided in parallel to user-oriented tabular content on the HTML page.
Here is an example showing a small table encoded in CSVW JSON-LD format. There are some known errors in the Structured Data Testing Tool.
Help and tools
- Google's Structured Data Markup Helper has support for Dataset markup.
- Use the Rich result status report in Search Console to see how your dataset performs in Google Search results. You can automatically pull these results with the Search Console API.
- The Google Webmaster Central Help Forum for Structured Data provides a community forum where you can ask (and answer) questions about structured data (including Datasets) and review our Frequently Asked Questions about Datasets.