Google BigQuery

Dealing with Data

This document describes how to deal with BigQuery data, such as setting parameters and handling nested and repeated fields.

Contents

Parameters

REST API methods accept three types of parameters: path parameters, query parameters, and body parameters. The following method signature demonstrates all three parameter types:

PUT https://www.googleapis.com/bigquery/v2/projects/{projectId}/datasets/{datasetId}/tables/{tableId}?userIp="192.0.2.211"
{
   "friendlyName": string,
   "description": string
}
  • projectId , datasetId, and tableId are all path parameters
  • userIp is a query parameter
  • friendlyName and description are both body parameters

The API documentation lists all the query parameters defined specifically by BigQuery. Query parameters that apply to all operations are shown below.

Setting Parameters

Different client libraries expose different techniques for setting these different types of parameters. For example, when using the Python client, you set path and query parameters the same way, but use a different method to set body parameters:

updateResponse = tableCollection.update(projectId='1234',           # Path param
                                        datasetId='5678',           # Path param
                                        tableId='9012',             # Path param
                                        userIp='192.0.2.211'        # Query param
                                        body={'friendlyName':'Donut Count',                 # Body params
                                              'description':'Worldwide donut usage count'}) #

Back to top

Query parameters that apply to all Google BigQuery API operations are shown in the table below.

Notes (on API keys and auth tokens):

  1. The key parameter is required with every request, unless you provide an OAuth 2.0 token with the request.
  2. You must send an authorization token with every request that is marked (AUTHENTICATED). OAuth 2.0 is the preferred authorization protocol.
  3. You can provide an OAuth 2.0 token with any request in one of two ways:
    • Using the access_token query parameter like this: ?access_token=oauth2-token
    • Using the HTTP Authorization header like this: Authorization: Bearer oauth2-token

All parameters are optional except where noted.

Parameter Meaning Notes
access_token OAuth 2.0 token for the current user.
callback Callback function.
  • Name of the JavaScript callback function that handles the response.
  • Used in JavaScript JSON-P requests.
fields Selector specifying a subset of fields to include in the response.
  • For more information, see the partial response section in the Performance Tips document.
  • Use for better performance.
key API key. (REQUIRED*)
  • *Required unless you provide an OAuth 2.0 token.
  • Your API key identifies your project and provides you with API access, quota, and reports.
  • Obtain your project's API key from the Google Developers Console.
prettyPrint

Returns response with indentations and line breaks.

  • Returns the response in a human-readable format if true.
  • Default value: true.
  • When this is false, it can reduce the response payload size, which might lead to better performance in some environments.
quotaUser Alternative to userIp.
  • Lets you enforce per-user quotas from a server-side application even in cases when the user's IP address is unknown. This can occur, for example, with applications that run cron jobs on App Engine on a user's behalf.
  • You can choose any arbitrary string that uniquely identifies a user, but it is limited to 40 characters.
  • Overrides userIp if both are provided.
  • Learn more about capping usage.
userIp IP address of the end user for whom the API call is being made.
  • Lets you enforce per-user quotas when calling the API from a server-side application.
  • Learn more about capping usage.

Paging Through list Results

All collection.list methods return paginated results under certain circumstances. The number of results per page is controlled by the maxResults property.

Method Pagination criteria Default maxResults value Maximum maxResults value
Tabledata.list Returns paginated results if the response size is more than 10 MB of serialized JSON or more than maxResults rows. 100,000 100,000
All other collection.list methods Returns paginated results if the response is more than maxResults rows. 50 1,000

If you set maxResults to a value greater than the maximum value listed above, the results are paginated based on the maximum value.

A page is a subset of the total number of rows. If your results are more than one page of data, the result data will have a nextPageToken property. To retrieve the next page of results, make another list call and include the token value as a URL parameter named pageToken.

The bigquery.tabledata.list method, which is used to page through table data, uses a row offset value or a page token. See Browsing Through Table Data for information.

Requesting Arbitrary Pages, and Avoiding Redundant List Calls

When you page backwards or jump to arbitrary pages using cached nextPageToken values, it is possible that the data in your pages might have changed since it was last viewed but there is no clear indication that the data might have changed. To mitigate this, you can use the Etag property.

Every collection.list method (except for Tabledata) returns an Etag property in the result. This property is a hash of the page results that can be used to verify whether the page has changed since last request. When you make a request to BigQuery with an Etag value, BigQuery compares the Etag value to the ETag value returned by the API and responds based on whether the ETag values match. You can use ETags to help avoid redundant list calls in the following ways:

  • If you only want to return list values if the values have changed:

    If you want only want to return a page of list values if the values have changed, you can make a list call with a previously-stored ETag using the HTTP "if-none-match" header. If the ETag you provide doesn't match the ETag on the server, BigQuery returns a page of new list values. If the ETags do match, BigQuery returns a HTTP 304 "Not Modified" result and no values. An example of this might be a webpage where users might periodically fill in information that is stored in BigQuery. You can avoid making redundant list calls to BigQuery if there are no changes to your data by using the if-none-match header with ETags.

  • If you only want to return list values if the values have not changed:

    If you only want to return a page of list values if the list values have not changed, you can use the HTTP "if-match" header. BigQuery matches the ETag values and returns the page of results if the results have not changed or returns a 412 "Precondition Failed" result if the page has changed.

Note: Although ETags are a great way to avoid making redundant list calls, you can apply the same methods to identifying if any objects have changed. For example, you can perform a Get request for a specific table and use ETags to determine if the table has changed before returning the full response.

Back to top

Nested and Repeated Data

BigQuery supports importing and exporting nested and repeated data in the form of JSON files. Once imported you can query your repeated and nested data using the FLATTEN and WITHIN SQL functions.

You can try running these queries yourself by downloading the following sample files and uploading it to your BigQuery account:

FLATTEN

When you query nested data, BigQuery automatically flattens the table data for you. For example, let's take a look at personsDataSchema.json:

   Last modified                 Schema                 Total Rows   Total Bytes   Expiration
 ----------------- ----------------------------------- ------------ ------------- ------------
  27 Sep 10:01:06   |- kind: string                     4            794
                    |- fullName: string (required)
                    |- age: integer
                    |- gender: string
                    +- phoneNumber: record
                    |  |- areaCode: integer
                    |  |- number: integer
                    +- children: record (repeated)
                    |  |- name: string
                    |  |- gender: string
                    |  |- age: integer
                    +- citiesLived: record (repeated)
                    |  |- place: string
                    |  +- yearsLived: integer (repeated)

Notice that there are several repeated and nested fields. If you run the following query:

SELECT
  fullName AS name,
  age,
  gender,
  citiesLived.place,
  citiesLived.yearsLived
FROM [dataset.tableId]

Notice that BigQuery returns your data with a flattened output:

+---------------+-----+--------+-------------------+------------------------+
|     name      | age | gender | citiesLived_place | citiesLived_yearsLived |
+---------------+-----+--------+-------------------+------------------------+
| John Doe      |  22 | Male   | Seattle           |                   1995 |
| John Doe      |  22 | Male   | Stockholm         |                   2005 |
| Mike Jones    |  35 | Male   | Los Angeles       |                   1989 |
| Mike Jones    |  35 | Male   | Los Angeles       |                   1993 |
| Mike Jones    |  35 | Male   | Los Angeles       |                   1998 |
| Mike Jones    |  35 | Male   | Los Angeles       |                   2002 |
| Mike Jones    |  35 | Male   | Washington DC     |                   1990 |
| Mike Jones    |  35 | Male   | Washington DC     |                   1993 |
| Mike Jones    |  35 | Male   | Washington DC     |                   1998 |
| Mike Jones    |  35 | Male   | Washington DC     |                   2008 |
| Mike Jones    |  35 | Male   | Portland          |                   1993 |
| Mike Jones    |  35 | Male   | Portland          |                   1998 |
| Mike Jones    |  35 | Male   | Portland          |                   2003 |
| Mike Jones    |  35 | Male   | Portland          |                   2005 |
| Mike Jones    |  35 | Male   | Austin            |                   1973 |
| Mike Jones    |  35 | Male   | Austin            |                   1998 |
| Mike Jones    |  35 | Male   | Austin            |                   2001 |
| Mike Jones    |  35 | Male   | Austin            |                   2005 |
| Anna Karenina |  45 | Female | Stockholm         |                   1992 |
| Anna Karenina |  45 | Female | Stockholm         |                   1998 |
| Anna Karenina |  45 | Female | Stockholm         |                   2000 |
| Anna Karenina |  45 | Female | Stockholm         |                   2010 |
| Anna Karenina |  45 | Female | Russia            |                   1998 |
| Anna Karenina |  45 | Female | Russia            |                   2001 |
| Anna Karenina |  45 | Female | Russia            |                   2005 |
| Anna Karenina |  45 | Female | Austin            |                   1995 |
| Anna Karenina |  45 | Female | Austin            |                   1999 |
+---------------+-----+--------+-------------------+------------------------+

In this example, citiesLived.place is now citiesLived_place and citiesLived.yearsLived is now citiesLived_yearsLived.

Although BigQuery can automatically flatten nested fields, you may need to explicitly call FLATTEN when dealing with more than one repeated field. For example, if you try to run the following query:

SELECT fullName, age
FROM [dataset.tableId]
WHERE
  (citiesLived.yearsLived > 1995 ) AND
  (children.age > 3)

BigQuery returns an error similar to the following:

Cannot query the cross product of repeated fields children.age and citiesLived.yearsLived

To query across more than one repeated field, you need to flatten one of the fields:

SELECT
  fullName,
  age,
  gender,
  citiesLived.place
FROM (FLATTEN([dataset.tableId], children))
WHERE
  (citiesLived.yearsLived > 1995) AND
  (children.age > 3)
GROUP BY fullName, age, gender, citiesLived.place

Which returns:

+------------+-----+--------+-------------------+
|  fullName  | age | gender | citiesLived_place |
+------------+-----+--------+-------------------+
| John Doe   |  22 | Male   | Stockholm         |
| Mike Jones |  35 | Male   | Los Angeles       |
| Mike Jones |  35 | Male   | Washington DC     |
| Mike Jones |  35 | Male   | Portland          |
| Mike Jones |  35 | Male   | Austin            |
+------------+-----+--------+-------------------+

WITHIN Clause

The WITHIN keyword specifically works with aggregate functions to aggregate across children and repeated fields within records and nested fields. When you specify the WITHIN keyword, you need to specify the scope over which you want to aggregate:

  • WITHIN RECORD: Aggregates data in the repeated values within the record.
  • WITHIN node_name: Aggregates data in the repeated values within the specified node, where a node is a parent node of the field in the aggregation function.

Suppose that you want to find the number of children each person in our previous example has. To do so, you can count the number of children.name each record has:

SELECT
  fullName,
  COUNT(children.name) WITHIN RECORD AS numberOfChildren
FROM [dataset.tableId];

You get the following result:

+---------------+------------------+
|   fullName    | numberOfChildren |
+---------------+------------------+
| John Doe      |                2 |
| Jane Austen   |                2 |
| Mike Jones    |                3 |
| Anna Karenina |                0 |
+---------------+------------------+

To compare, try listing all of the children's names:

SELECT fullName, children.name
FROM [dataset.tableId]
+---------------+---------------+
|   fullName    | children_name |
+---------------+---------------+
| John Doe      | Jane          |
| John Doe      | John          |
| Jane Austen   | Josh          |
| Jane Austen   | Jim           |
| Mike Jones    | Earl          |
| Mike Jones    | Sam           |
| Mike Jones    | Kit           |
| Anna Karenina | None          |
+---------------+---------------+

This matches with our WITHIN RECORD query results; John Doe does have two children named Jane and John, Jane Austen has two children named Josh and Kim, Mike Jones has three children named Earl, Sam, and Kit, and Anna Karenina doesn't have any children.

Now, suppose that you want to find the number of times a person has lived in different places. You can use the WITHIN clause to aggregate across one particular node:

SELECT
  fullName,
  COUNT(citiesLived.place) WITHIN RECORD AS numberOfPlacesLived,
  citiesLived.place,
  COUNT(citiesLived.yearsLived) WITHIN citiesLived AS numberOfTimesInEachCity,
FROM [dataset.tableId];
+---------------+---------------------+-------------------+-------------------------+
|   fullName    | numberOfPlacesLived | citiesLived_place | numberOfTimesInEachCity |
+---------------+---------------------+-------------------+-------------------------+
| John Doe      |                   2 | Seattle           |                       1 |
| John Doe      |                   2 | Stockholm         |                       1 |
| Mike Jones    |                   4 | Los Angeles       |                       4 |
| Mike Jones    |                   4 | Washington DC     |                       4 |
| Mike Jones    |                   4 | Portland          |                       4 |
| Mike Jones    |                   4 | Austin            |                       4 |
| Anna Karenina |                   3 | Stockholm         |                       4 |
| Anna Karenina |                   3 | Russia            |                       3 |
| Anna Karenina |                   3 | Austin            |                       2 |
+---------------+---------------------+-------------------+-------------------------+

This query does the following:

  • Performs a WITHIN RECORD on citiesLived.place and counts the number of places each person has lived in
  • Performs a WITHIN on citiesLived.yearsLived and counts the number of times each person has lived in each city (counting just across citiesLived).

Back to top

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.