Google BigQuery

Preparing Data for BigQuery

Depending on your data's structure, you might need to prepare the data before loading it into BigQuery. For example, you might need to export your data into a different format, or transform the data. This topic describes the data types and formats that BigQuery expects.

Contents

Data formats

BigQuery supports two data formats:

Choose CSV or JSON based upon the following factors:

  • Your data's schema.

    CSV and JSON both support flat data. JSON also supports data with nested/repeated fields. Nested/repeated data can be useful for expressing hierarchical data, and reduces duplication when denormalizing the data.

  • Embedded newlines.

    If your data contains embedded newlines, BigQuery can load the data much faster in JSON format.

  • External limitations.

    For example, your data might come from a document store database that natively stores data in JSON format. Or, your data might come from a source that only exports in CSV format.

When loading data into BigQuery, specify the data format using the configuration.load.sourceFormat property.

Data format limits

CSV and JSON have the following row and cell size limits:

Data format Max limit
CSV 64 KB (row and cell size)
JSON 2 MB (row size)

CSV and JSON have the following file size limits

File Type Compressed Uncompressed
CSV 1 GB
  • With new-lines in strings: 4 GB
  • Without new-lines in strings: 1 TB
JSON 1 GB 1 TB

CSV format

You can specify additional properties in the configuration.load object to change how BigQuery parses CSV data.

CSV data type BigQuery property
Header rows skipLeadingRows
Newline characters allowQuotedNewlines
Custom field delimiters fieldDelimiter
Trailing optional columns allowJaggedRows

JSON format

One JSON object, including any nested/repeated fields, must appear on each line.

The following example shows sample nested/repeated data:

{"kind": "person", "fullName": "John Doe", "age": 22, "gender": "Male", "citiesLived": [{ "place": "Seattle", "numberOfYears": 5}, {"place": "Stockholm", "numberOfYears": 6}]}
{"kind": "person", "fullName": "Jane Austen", "age": 24, "gender": "Female", "citiesLived": [{"place": "Los Angeles", "numberOfYears": 2}, {"place": "Tokyo", "numberOfYears": 2}]}

Back to top

Data types

Your data can include strings, integers, floats, booleans, nested/repeated records and timestamps. For detailed information about each data type, see schema.fields.type.

Back to top

Data encoding

BigQuery supports UTF-8 encoding for both nested/repeated and flat data, and supports ISO-8859-1 encoding for flat data.

If you plan to load ISO-8859-1 encoded flat data, specify the configuration.load.encoding property.

Back to top

Data Compression

BigQuery can load uncompressed files significantly faster than compressed files due to parallel load operations, but because uncompressed files are larger in size, using them can lead to bandwidth limitations and higher Google Cloud Storage costs. For example, uncompressed files that live on third-party services can consume considerable bandwidth and time if uploaded to Google Cloud Storage for loading. It's important to weigh these tradeoffs depending on your use case.

In general, if bandwidth is limited, gzip compress files before uploading them to Google Cloud Storage. If loading speed is important to your app and you have a lot of bandwidth to load your data, leave files uncompressed.

Back to top

Denormalizing data

Many developers are accustomed to working with relational databases and normalized data schemas. Normalization eliminates duplicate data from being stored, and provides an important benefit of consistency when regular updates are being made to the data.

In BigQuery, you typically want to denormalize the data structure in order to enable super-fast querying. While JOINs on small datasets are possible with BigQuery, they're not as performant as a denormalized structure.

Some type of normalization is possible with the nested/repeated functionality.

Let's take a simple example -- recording the cities that a list of people lived in during their lives. Here's how you'd typically design that in a relational database (assuming each person's name is unique):

Each person exists once in the people table, and has one or more rows in cities_lived, representing each city they lived in during their life. This model could be used in BigQuery, but would require a JOIN, which can potentially impact performance.

Denormalizing this to a flat schema would look like:

Each person could potentially have many rows in the table -- one row for each city they've lived in during their life. The age and gender information for the person is duplicated in each of these rows, meaning multiple values for each exist. Assuming that this table represents a snapshot in time, these values will each be the same, though will occupy some space in the table.

Now, let's convert to a nested/repeated schema, which can be loaded into BigQuery using a JSON data format.

Each person exists once in this table, with one value for age and one value for gender. We've nested one more more sets of city+years_lived pairs in a repeated field called cities_lived. This avoids duplication of data, keeps a natural format for understanding the data, and retains the high performance characteristics of the flat schema.

Back to top

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.