Google BigQuery

bq Command-Line Tool

bq is a python-based tool that accesses BigQuery from the command line.

Contents

Installation

Install the Google Cloud SDK.

General Usage

Flag Positioning

bq supports two kinds of flags—common flags and command flags—and they must be used in the order shown here:

bq --common_flags <bq_command> --command-specific_flags <command_arguments>
  • Common flags are flags that apply to all commands, such as --project_id or --apilog. Call bq --helpshort for a list of often-used common flags.
  • Command-specific flags are any flags that apply to a specific command: for example, -p, -j, -d for the ls command (to list projects, jobs, and datasets respectively).

Listing common flags after the command will usually fail.

Example: bq --format=json ls -d

Help and Debugging

  • For a full list of commands, run bq help
  • For help on a specific command, run bq help <command>
  • To see the actual requests sent and received, which is useful for debugging your applications, add the flag --apilog=<file> to save a log of operations to file. Using - instead of a file path will print the log to the console. bq works by making standard REST-based API calls, which can be useful to see. It is also useful to attach this log when reporting issues.

Setting Default Values for Common Flags

The first time you run bq, you will be prompted for a default project ID. This is stored in your .bigqueryrc file.

You can update this, or add additional common flag values by modifying your .bigqueryrc file. Add new default values, one per line, as flagname=flagvalue. You can only set common flag defaults in this file; command-specific flags will raise an error when trying to run bq. This file is read every time you run bq in the single usage mode, so changes should be updated immediately. However when you run bq in interactive mode (bq shell) you must restart the shell before changes will be imported.

$ cat ~/.bigqueryrc
project_id = 1092187650
dataset_id=mydataset

Running Asynchronous Operations

The bq tool runs asynchronous commands by updating status counter on the command line. If a command is going to take a very long time you can instead ask for the job ID when starting the job and return control immediately, and then later ask bq to hook into that running job. This frees up your command-line for doing other things. Note that this is useful only for potentially long-running jobs such as queries or loads.

To start an asynchronous job and ask for the ID, call your bq command with the --nosync flag, which will return the job ID. To get the job results (or print out job status), pass this job ID into the command bq wait <job_id>.

The following example starts an asynchronous load job, lists the datasets in the current database, and later makes a call to get the job results.

$ bq --nosync load mydataset.names2010 gs://bigquerybucket/yob2010.txt name:string,gender:string,count:integer
Successfully started load job job_f0afc92f7a9043849e5cae7b2e7ea2f9
$ bq ls
   datasetId
 -------------
  olddataset
  mydataset
$ bq wait job_f0afc92f7a9043849e5cae7b2e7ea2f9

Running bq in Interactive Mode

You can run bq in an interactive mode, where you don't need to prefix the commands with "bq". To start interactive mode, call bq shell. The prompt is the ID of the default project. To exit interactive mode, type "exit".

$ bq shell
Welcome to BigQuery! (Type help for more information.)
10000000021> ls
   datasetId
 -------------
  mydataset
10000000021> exit
Goodbye.

Useful Common Flags

Common flags are used between bq and the command. For a full list of flags, call bq --helpshort.

Here are some of the most useful flags:

  • --apilog - Turn on logging of all server requests and responses. If no string is provided, log to stdout; if a string is provided, instead log to
    that file.
  • --format - [none|json|prettyjson|csv|sparse|pretty] The output format.

Listing Projects, Datasets, and Tables

To list objects, use the bq ls command with the following syntax:

bq ls [<project_id>:][<dataset_id>]

<project_id> and <dataset_id> are optional if you have those values defined in your .bigqueryrc file, however values that you pass in to bq ls will override any defined values in .bigqueryrc. A project ID is not the friendly name; a project ID is usually just a string of digits.

More information:

Working With Projects

List Projects

To list all projects:

bq ls -p

Set a Default Project

When you first run bq it will ask you for a default project ID and save that in the .bigqueryrc file. You can edit this file to change the value of your default project.

See Your Default Project

Run cat ~/.bigqueryrc to see your default values. Alternatively, when you run in interactive mode, the prompt lists the default project ID.

Working With Datasets

Set a Default Dataset

You can specify a default dataset, so that you don't have to qualify your tables or other actions with a dataset ID when using your default dataset. To do so, add the following line to your .bigqueryrc file:

dataset_id=<dataset_id>

A quick way to do this:

$ echo dataset_id=dataset_id >> ~/.bigqueryrc

List Datasets

The syntax differs, depending on whether you have a default project or dataset defined:

# List datasets in the default project:
bq ls

# List datasets another project:
bq ls my_project_id:

# List datasets when you have a default dataset defined:
bq ls -d
  OR
bq ls :

Working With Tables

Get Table Information

bq show <project_id>:<dataset_id>.<table_id>

Example

$ bq show publicdata:samples.shakespeare
    tableId      Last modified                  Schema
 ------------- ----------------- ------------------------------------
  shakespeare   01 Sep 13:46:28   |- word: string (required)
                                  |- word_count: integer (required)
                                  |- corpus: string (required)
                                  |- corpus_date: integer (required)

Previewing Table Data

bq head [-n <rows>] <project_id>:<dataset_id>.<table_id>

Example

$ bq head -n 10 publicdata:samples.shakespeare
+--------------+------------+--------------+-------------+
|     word     | word_count |    corpus    | corpus_date |
+--------------+------------+--------------+-------------+
| brave        |          6 | 1kinghenryiv |        1597 |
| profession   |          1 | 1kinghenryiv |        1597 |
| treason      |          2 | 1kinghenryiv |        1597 |
| Ned          |          9 | 1kinghenryiv |        1597 |
| answered     |          1 | 1kinghenryiv |        1597 |
| Perceived    |          1 | 1kinghenryiv |        1597 |
| 'You         |          1 | 1kinghenryiv |        1597 |
| degenerate   |          1 | 1kinghenryiv |        1597 |
| neighbouring |          1 | 1kinghenryiv |        1597 |
| grandam      |          1 | 1kinghenryiv |        1597 |
+--------------+------------+--------------+-------------+

Note that this operation is only intended for previewing the contents of a table and is not an efficient way to extract a large portion of a table. If not explicitly specified, the command defaults to returning 100 rows.

List Tables

The syntax differs, depending on whether you have a default project or dataset defined:

# In the default project:
  bq ls dataset_id

# In the default project and dataset:
bq ls

# In another project or dataset:
  bq ls [project_id:][dataset_id]

Create a Table from a File

You can load an uncompressed or gzipped CSV or JSON file from Google Cloud Storage or an uncompressed file from your disk to create a new table. You can create a new table by creating a new schema and then loading the files in separate calls, or you can combine both actions into a single call with the following syntax:

bq load [--source_format=NEWLINE_DELIMITED_JSON|CSV] <destination_table> <data_source_uri> <table_schema>
destination_table
The fully-qualified table name of the table to create, or append to if the table already exists
--source_format
BigQuery accepts both CSV and JSON files. By default, if you do not explicitly specify the type of file, BigQuery expects a CSV file. If you load a JSON file, you must provide the --source_format=NEWLINE_DELIMITED_JSON flag. Your source file and schema must also follow the proper JSON structure.
data_source
The source CSV data file used to populate the table. Note that this can be an uncompressed local file or, alternatively, a fully-qualified Google Cloud Storage URI referring to an uncompressed or gzipped file, in the format gs://bucket/file. For example, the following are all valid file types: my_file.csv, gs://my_file.csv.gz, or my_file.csv.gz.
table_schema
A description of the table schema to use. This can be either a file name or a comma-separated list of column_name:datatype pairs. We will use the comma-separated list for this example. Try using the following schema descriptor for your table: name:string,gender:string,count:integer where "name", "gender", and "count" are labels that are assigned to the columns in the new table.

If you would like, you can also specify your schema in a separate file and provide that file as the table schema. Your schema file must contain a single array object with entries that provide the following properties:

  • "name": Name of the column
  • "type": Type of data, e.g. string
  • "mode" (optional): Whether this field can be null

A sample schema file might look like the following:

[
  {"name": "name", "type": "string", "mode": "required"},
  {"name": "gender", "type": "string", "mode": "nullable"},
  {"name": "count", "type": "integer", "mode": "required"}
]

See the bq command-line tool quickstart for a detailed walkthrough of creating and populating a table using bq.

When using the bq load command, you can specify the following optional flags:

Short Flag Long Flag Type Description Valid Values Usage
None --source_format string The source file type. This can be either JSON or CSV,
  • NEWLINE_DELIMITED_JSON
  • CSV
--source_format=NEWLINE_DELIMITED_JSON
-F --field_delimiter string The character that indicates the boundary between columns in the input file. By default, this is a comma. Any ASCII value
  • -F '|'
  • --field_delimiter='|'
-E --encoding string The character encoding used by the input file.
  • UTF-8
  • ISO-8859-1
  • -E ISO-8859-1
  • --encoding=UTF-8
None --max_bad_records integer The maximum number of bad rows to skip before the load job is aborted and no updates are performed. If this value is larger than 0, the job will succeed as long as the number of bad records do not exceed this value. This is useful if you would like to load files that may have bad records. The default value for this parameter is 0 (all rows are required to be valid). Any integer --max_bad_records=3
None --skip_leading_rows integer Skip a certain number of top rows. This is useful for skipping header rows in your source CSV file. The default value for this parameter is 0 (all rows are considered data rows). Any integer --skip_leading_rows=1

Character Encodings

By default, the BigQuery service expects all source data to be UTF-8 encoded. Optionally, if you have CSV files with data encoded in ISO-8859-1 format, you should explicitly specify the encoding when you import your data so that BigQuery can properly convert your data to UTF-8 during the import process. Currently, it is only possible to import data that is ISO-8859-1 or UTF-8 encoded. Keep in mind the following when you specify the character encoding of your data:

  • If you don't specify an encoding, or explicitly specify that your data is UTF-8 but then provide a CSV file that is not UTF-8 encoded, BigQuery attempts to convert your CSV file to UTF-8.

    Generally, your data will be imported successfully but may not match byte-for-byte what you expect. To avoid this, specify the correct encoding and try your import again.

  • Delimiters must be encoded as ISO-8859-1.

    Generally, it is best practice to use a standard delimiter, such as a tab, pipe, or comma.

  • If BigQuery cannot convert a character, it is converted to the standard Unicode replacement character: �.

JSON files must always be encoded in UTF-8.

Copy an Existing Table

To copy a table, run the bq cp command:

bq cp <source_table> <destination_table>

For example, a sample copy command might look like the following:

bq cp dataset.mytable dataset2.mynewtable

You can also copy tables across different projects by specifying the project id in the source and destination path:

bq cp 123456789123:dataset.mytable 0987654321098:dataset2.mynewtable

Note that when you perform a table copy, you must specify a unique destination table. For example, if you already have a table called mynewtable in dataset2, the above command will fail and the command line tool throws an exception. If you would like to append or overwrite an existing table, you can programmatically start a copy job that appends or overwrites an existing table using the API.

Create a Table From a Query

Run a query, and specify the --destination_table flag

bq query --destination_table=mydataset.happyhalloween "SELECT name,count FROM mydataset.babynames WHERE gender = 'M' ORDER BY count DESC LIMIT 6"

Working With Queries

Run a Query

Run a query using bq query "query_string". You do not need an ending ; (semicolon) mark for your query. You do not need to specify the project ID or dataset ID if you have defined default values. You can use the common --format flag to specify other output formats, such as JSON or CSV. If a query will take a long time to run, you can run it asynchronously in the command line.

To save the results to a table, use the --destination_table flag.

$ bq ls -d
   datasetId
 -------------
  mydataset

$ bq ls :mydataset
     tableId
 ---------------
  babynames

$ bq query "SELECT name,count FROM mydataset.babynames WHERE gender = 'M' ORDER BY count DESC LIMIT 6"
Waiting on job_a4e77f793e7b4d5bbc1fd69244d9e792 ... (0s) Current status: DONE
+----------+-------+
|   name   | COUNT |
+----------+-------+
| Zachary  | 22731 |
| Alfred   | 20477 |
| Gregory  | 17179 |
| Ned      | 16860 |
| Ulrich   | 15300 |
| Thomas   | 14995 |
+----------+-------+

You can also run batch queries by using the --batch flag.

The following example shows how to start an asynchronous batch query:

bq --nosync query --batch "SELECT name,count FROM mydataset.babynames WHERE gender = 'M' ORDER BY count DESC LIMIT 6"

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.