Google BigQuery

Streaming Data Into BigQuery

Instead of using a job to load data into BigQuery, you can choose to stream your data into BigQuery one record at a time by using the tabledata().insertAll() method. This approach enables querying data without the delay of running a load job. There are several important trade-offs to consider before choosing an approach.

Contents

Access control

Streaming data into BigQuery requires the following access levels.

Product

Access

BigQuery

WRITE access for the dataset that contains the destination table. For more information, see access control.

Back to top

Quota policy

The following limits apply for streaming data into BigQuery.

  • Maximum row size: 20 KB
  • Maximum data size of all rows, per insert: 1 MB
  • Maximum rows per second: 10,000 rows per second, per table. Exceeding this amount will cause quota_exceeded errors. For additional support up to 100,000 rows per second, per table, please contact a sales representative.
  • Maximum bytes per second: 10 MB per second, per table. Exceeding this amount will cause quota_exceeded errors.

Back to top

Data availability

The first time a streaming insert occurs, the streamed data is inaccessible for a warm-up period of up to two minutes. After the warm-up period, all streamed data added during and after the warm-up period is immediately queryable. After several hours of inactivity, the warm-up period will occur again during the next insert.

Back to top

Data durability

After data is streamed into BigQuery, the data is temporarily stored in a data buffer for a period between several minutes and up to an hour under normal conditions. The data is then replicated to additional regions for higher durability.

Back to top

Data consistency

To help ensure data consistency, you can supply insertId for each inserted row. BigQuery remembers this ID for a period of several minutes. If you try to stream the same set of rows within that time period and the insertId property is set, BigQuery uses the insertId property for de-duplication.

It's important to remember this fact when troubleshooting streaming inserts, as there's no way to tell if the streaming insert succeeded if you receive a failure HTTP response code such as a network error.

Back to top

Example use cases

High volume event logging

If you have an app that collects a large amount of data in real-time, streaming inserts can be a good choice. Generally, these types of apps have the following criteria:

  • Not transactional. Small amounts of data loss or duplication, as well as short periods of time when data is unavailable, are tolerable.
  • Aggregate analysis. Queries generally are performed for trend analysis, as opposed to single or narrow record selection.

One example of high volume event logging is event tracking. Suppose you have a mobile app that tracks events. Your app, or mobile servers, could independently record user interactions or system errors and stream them into BigQuery. You could analyze this data to determine overall trends, such as areas of high interaction or problems, and monitor error conditions in real-time.

Real-time dashboards and queries

In certain situations, streaming data into BigQuery enables real-time analysis over transactional data. Since streaming data comes with a possibility of duplicated rows and data loss, ensure that you have a primary, transactional data store outside of BigQuery.

You can take a few precautions to ensure that you'll be able to perform analysis over transactional data, and also have an up-to-the-second view of your data:

  1. Create two tables with an identical schema. The first table is for the reconciled data, and the second table is for the real-time, unreconciled data.
  2. On the client side, maintain a transactional data store for records.
  3. Fire-and-forget insertAll() requests for these records. The insertAll() request should specify the real-time, unreconciled table as the destination table.
  4. At some interval, append the reconciled data from the transactional data store and truncate the unreconciled data table.
  5. For real-time dashboards and queries, you can select data from both tables. The unreconciled data table might include duplicates or dropped records.

Back to top

Troubleshooting streaming inserts

Failure HTTP response codes

If you receive a failure HTTP response code such as a network error, there's no way to tell if the streaming insert succeeded. If you try to simply re-send the request, you might end up with duplicated rows in your table. To help protect your table against duplication, set the insertId property when sending your request. BigQuery uses the insertId property for de-duplication.

If you receive a permission error, an invalid table name error or an exceeded quota error, no rows are inserted and the entire request fails.

Success HTTP response codes

Even if you receive a success HTTP response code, you'll need to check the insertErrors property of the response to determine if the row insertions were successful.

If BigQuery encountered errors processing individual rows of the request, such as a schema mismatch, none of the rows are inserted and an insertErrors entry is returned for each row. If the errors property of the insertErrors entry is empty, there was no problem with the row. Otherwise, the errors property contains detailed information about why the row caused an error.

Back to top

Streaming insert examples

Java

This sample uses the Google APIs Client Library for Java.

TableRow row = new TableRow();
row.set("column_name", 7.7);
TableDataInsertAllRequest.Rows rows = new TableDataInsertAllRequest.Rows();
rows.setInsertId(timestamp);
rows.setJson(row);
List  rowList =
    new ArrayList();
rowList.add(rows);
TableDataInsertAllRequest content = 
    new TableDataInsertAllRequest().setRows(rowList);
TableDataInsertAllResponse response =
    bigquery.tabledata().insertAll(
        projectId, datasetId, tableId, content).execute();

Python

This sample uses the Google APIs Client Library for Python.

body = {"rows":[
    {"json": {"column_name":7.7,}}
    ]}
response = bigquery.tabledata().insertAll(
    projectId=PROJECT_ID,
    datasetId=DATASET_ID,
    tableId=TABLE_ID,
    body=body).execute()

Back to top

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.