Google BigQuery

Codelab: Getting Started with BigQuery and the Google Java API Client library

Author: Michael Manoochehri, Updated: 12/04/2012, Tags: BigQuery, Java

Google's BigQuery Service features a REST-based API that allows developers to create applications to run ad-hoc queries on massive datasets. This article describes how to build a sample Java application that accesses the BigQuery API using the Google Java API Client Libraries.

If you haven't done so already, please read the BigQuery Getting Started and BigQuery Main Concept guides, which provide information about how the Google BigQuery service works, as well as typical use cases.

Contents

  1. Prerequisites for BigQuery Java Development
    1. Access to the BigQuery Service
    2. Using Downloaded .jar Files
    3. Using Maven to Manage Dependencies
  2. Creating and Authorizing a Java BigQuery API Client
    1. Create a Client ID and Client Secret
    2. Import Required Client Classes
    3. Prompt the User to Authorize Access to BigQuery
    4. Instantiate a BigQuery Client Object
  3. Making Calls to the BigQuery API
  4. Starting a Query Job and Polling for a Result
    1. Insert a Query Job
    2. Poll the BigQuery API for the Query Result
    3. Retrieve and Display Query Results
  5. Complete Source Code and Further Steps

Prerequisites for BigQuery Java Development

Before you begin to build anything with the BigQuery API, make sure that you have access to Google's BigQuery Service.

Access to the BigQuery Service

You'll need to create, or have access to, a Google API Console project that has the BigQuery API service activated.

To create a new project, visit the Google API Console and find the project name dropdown menu on the left side. Select "Create" and give your new project a name.

Activate the BigQuery API by navigating to the Services option, and toggling the status to On. Each Google developer project has an associated ID. Many of the BigQuery API methods require that a project ID be included as a parameter. The default ID of your current project can be found in the URL of the Google API Console, or by hovering over the name of the project in the BigQuery Browser Tool.

Now that we have created a Google APIs Console project with access to the BigQuery API, let's take a look at the additional libraries we need to access BigQuery from our Java development environment using the Google API Java Client library. At minimum, your project needs to access the classes contained in these files:

  1. The Generated Java client library for BigQuery
  2. The Google HTTP Client Library for Java
  3. The Google OAuth Client Library for Java

These libraries include classes for JSON manipulation and HTTP requests, along with classes to help with OAuth 2.0 authorization and token management. The generated BigQuery Java client provides classes specific to the BigQuery API.

Using Downloaded .jar Files

You can add the Generated BigQuery Java Client .jar files, as well as the HTTP and OAuth client library .jar files to your Java application directly, by download the adding them to the build path of your project.

Using Maven to Manage Dependencies

If your Java project is managed using Apache Maven, add the following information to your Maven project's pom.xml file:

1. First, add the following information to the <repositories> section of your pom.xml file.

<repositories>
    ...
    <repository>
      <id>google-api-services</id>
      <url>https://google-api-client-libraries.appspot.com/mavenrepo</url>
      <snapshots>
        <enabled>true</enabled>
      </snapshots>
    </repository>
</repositories>

2. Next, in the <dependencies> section of your pom.xml file, add the following dependency information with the google-api-services-bigquery artifact and version information.

<dependencies>
    ...
    <dependency>
        <groupId>com.google.apis</groupId>
        <artifactId>google-api-services-bigquery</artifactId>
        <version>v2-rev42-1.12.0-beta</version>
    </dependency>
    <dependency>
         <groupId>com.google.oauth-client</groupId>
         <artifactId>google-oauth-client</artifactId>
         <version>1.12.0-beta</version>
     </dependency>
    <dependency>
        <groupId>com.google.http-client</groupId>
        <artifactId>google-http-client-jackson2</artifactId>
        <version>1.12.0-beta</version>
    </dependency>
</dependencies>

Note: if you are using the Maven plugin for Eclipse (M2Eclipse), ensure that the options to Download Artifact Sources and Download Artifact JavaDoc are selected in the Preferences->Maven panel.

Creating and Authorizing a Java BigQuery API Client

All calls to the BigQuery API must be authorized by an authenticated user who is listed as a member of your Google APIs Console project's team. We recommend using the OAuth 2.0 protocol to authorize requests to the BigQuery API.

Create a Client ID and Client Secret

In order for Google to verify that your application should be allowed to access the BigQuery API, we need to create a Client ID and Client Secret. Since our example application runs locally on the desktop, (rather than as a web application), create a Client ID for "installed applications."

In the Google API Console (through which we created our project), navigate to the API Access panel.

  1. Click on the Create an OAuth 2.0 client ID button.
  2. From there, give your application a name ("BigQuery Java Demo" is not a bad choice).
  3. Choose the Installed Application type, then complete the by Client ID process by clicking the Create client ID button.

If this were a web application, the Callback URL would be an actual web URL that would handle OAuth responses from Google. However, since this sample represents a locally-run desktop application, we use the special redirect URI value "urn:ietf:wg:oauth:2.0:oob" to indicate that we will be authorizing an access token via a web browser, and manually entering this token into our application.

We reccomend using the extensible client_secrets.json format to store OAuth 2.0 Client IDs and secrets, as well as redirect and authoraztion URIs. The GoogleClientSecrets class can be used to load information from a client_secrets.json file.

{
    "installed": {
        "client_id": "XXXXXXXXXXXXXXXXXXXX.apps.googleusercontent.com",
        "client_secret": "XXXXXXXXXXXXXXXXXXXX",
        "redirect_uris": ["urn:ietf:wg:oauth:2.0:oob"],
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://accounts.google.com/o/oauth2/token"
    }
}

Provide the ID number specific to your project, as well as the location of your client_secrets.json file:

private static final String PROJECT_ID = "XXXXXXXXX";
private static final String CLIENTSECRETS_LOCATION = "client_secrets.json";

static GoogleClientSecrets clientSecrets = loadClientSecrets();

static GoogleClientSecrets loadClientSecrets() {
  try {
    GoogleClientSecrets clientSecrets =
      GoogleClientSecrets.load(new JacksonFactory(),
        BigQueryJavaGettingStarted.class.getResourceAsStream(CLIENTSECRETS_LOCATION));
    return clientSecrets;
  } catch (Exception e) {
    System.out.println("Could not load client_secrets.json");
    e.printStackTrace();
  }
}

All Google APIs that require authorization provide an OAuth "scope" value that determines the level of data access that an application can have. In this case, we are requesting a single scope that allows our application to access the BigQuery datasets available to the user who authorized access. We could also add multiple scopes if we wanted to authorize this application to have access to additional Google API services.

import com.google.api.services.bigquery.BigqueryScopes;

private static final List SCOPES = Arrays.asList(BigqueryScopes.BIGQUERY);

Import Required Client Classes

Now that we have set up a Project, Client ID/Secret, and defined our Redirect URI, let's import some OAuth helper classes available in the Google Java API client distribution. These classes help us generate a URL that allows us to authorize use of the API, retrieve an Access Token, and specify that the Access Token be used as a parameter in HTTP requests to the BigQuery API.

import com.google.api.client.auth.oauth2.Credential;
import com.google.api.client.googleapis.auth.oauth2.GoogleAuthorizationCodeFlow;
import com.google.api.client.googleapis.auth.oauth2.GoogleAuthorizationCodeRequestUrl;
import com.google.api.client.googleapis.auth.oauth2.GoogleClientSecrets;
import com.google.api.client.googleapis.auth.oauth2.GoogleTokenResponse;

Since the BigQuery API features a RESTful JSON-RPC API, we'll also import classes to help us parse and serialize JSON, and other classes to handle HTTP requests.

import com.google.api.client.http.HttpTransport;
import com.google.api.client.http.javanet.NetHttpTransport;
import com.google.api.client.json.JsonFactory;
import com.google.api.client.json.jackson2.JacksonFactory;

Prompt the User to Authorize Access to BigQuery

The build method of the GoogleAuthorizationRequestUrl class provides an authorization URL based on the scope, callback URI, and Client ID that we have generated. Since this sample is a desktop application, we print out this authorization URL on the command line, and ask the user to navigate to the address with a web browser. The user is presented a webpage that requests permission to allow the application to access the BigQuery API. If permission is granted, an authorization code is displayed to the user, and the application pauses until the user pastes this code at the command line prompt.

String authorizeUrl =
   new GoogleAuthorizationCodeRequestUrl(clientSecrets,
                                         REDIRECT_URI,
                                         SCOPES).setState("").build();

System.out.println("Paste this URL into a web browser to authorize BigQuery Access:\n" + authorizeUrl);

System.out.println("... and type the code you received here: ");
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
String authorizationCode = in.readLine();

Once the authorization code is provided by the user, the GoogleAuthorizationCodeGrant class allows our application to obtain an OAuth 2.0 access token from Google. This access token is included in subsequent calls to the API.

static Credential exchangeCode(String authorizationCode) throws IOException {
    GoogleAuthorizationCodeFlow flow = getFlow();
    GoogleTokenResponse response =
        flow.newTokenRequest(authorizationCode).setRedirectUri(REDIRECT_URI).execute();
    return flow.createAndStoreCredential(response, null);
}

Credential credential = exchangeCode(authorizationCode);

Instantiate a BigQuery Client Object

Now that we have provided a mechanism to authorize access to the BigQuery API, let's instantiate a class Google Java API client to make calls to the BigQuery API. We can now import the Bigquery class from the BigQuery Java client library.

import com.google.api.services.bigquery.Bigquery;

The "builder" method of the Bigquery class creates an instance of a BigQuery API client. While creating a new Bigquery object, it's both a useful and compact design pattern to also call methods to handle HTTP requests and handle JSON object serialization.

private static final HttpTransport TRANSPORT = new NetHttpTransport();
private static final JsonFactory JSON_FACTORY = new JacksonFactory();

Bigquery bigquery = new Bigquery(TRANSPORT, JSON_FACTORY, credential);

Making Calls to the BigQuery API

At this point, our application has created a BigQuery client, and we have authorized our application to make calls to the BigQuery API. Let's start with a simple call to list all available datasets in our project. We first need to import additional classes that help us describe the data model of our request responses.

import com.google.api.services.bigquery.Bigquery.Datasets;
import com.google.api.services.bigquery.model.DatasetList;

With these classes imported, we can now use the bigquery.datasets().list() method to retrieve a list of BigQuery datasets available in the public sample data.

Our public sample dataset is stored in a project whose project ID is "publicdata." If you want the application to list the datasets contained in your project, you would use the project ID of that project instead.

public static void listDatasets(Bigquery bigquery, String projectId)
      throws IOException {
  Datasets.List datasetRequest = bigquery.datasets().list(projectId);
  DatasetList datasetList = datasetRequest.execute();
  if (datasetList.getDatasets() != null) {
    List datasets = datasetList.getDatasets();
    System.out.println("Available datasets:\n");
    for (DatasetList.Datasets dataset : datasets) {
      System.out.format("%s\n", dataset.getDatasetReference().getDatasetId());
    }
  }
}

Starting an Asynchronous Query Job and Polling for a Result

The BigQuery API provides an asynchronous model for making queries and returning the query result. In order to conform to this model, our sample application first "inserts" a query job, consisting of a SQL statement and a project ID.

Once the query Job has been successfully inserted, the application enters a loop, during which it polls the BigQuery API and check on the query job's status. If the status of the job is "PENDING" or "RUNNING," then the application waits for a short time, and polls BigQuery API again.

During our poll loop, if a call to the BigQuery API reveals that the status of the job is "DONE," we request the query result and display it to the user. In addition, the application displays the elapsed time during each polling event.

Insert a Query Job

Before we can use our authorized BigQuery client to insert query jobs, let's import some classes to help us configure our query job requests and make insertion calls to the API.

import com.google.api.services.bigquery.Bigquery.Jobs.Insert;
import com.google.api.services.bigquery.model.GetQueryResultsResponse;
import com.google.api.services.bigquery.model.Job;
import com.google.api.services.bigquery.model.JobConfiguration;
import com.google.api.services.bigquery.model.JobConfigurationQuery;
import com.google.api.services.bigquery.model.JobReference;

Let's start with the code for making an API call to insert a query job. We'll first set up everything we need for our query request, including information about our project and dataset, as well as the actual query statement. Then, using our authorized BigQuery service class, we'll make a call to Jobs.insert().

String querySql = "SELECT word, word_count FROM publicdata:samples.shakespeare LIMIT 10";

Job job = new Job();
JobConfiguration config = new JobConfiguration();
JobConfigurationQuery queryConfig = new JobConfigurationQuery();
config.setQuery(queryConfig);

job.setConfiguration(config);
queryConfig.setQuery(querySql);

Insert insert = bigquery.jobs().insert(projectId, job);
insert.setProjectId(projectId);
JobReference jobId = insert.execute().getJobReference();

System.out.format("\nJob ID of Query Job is: %s\n", jobId.getJobId());

Poll the BigQuery API for the Query Result

Now that we have inserted a query job, let's take a look at our polling loop.

To check on the status of our query job, we'll check the value of the Job.getStatus().getState() method. If our query job has not started running yet, or is currently being run, the status returns "PENDING" or "RUNNING", respectively.

If the query job has completed, the Job.getStatus().getState() method returns a value of "DONE." At this point, our method exits the polling loop and return a reference to the completed job ID.

long startTime = System.currentTimeMillis();
long elapsedTime = 0;

while (true) {
 Job pollJob = bigquery.jobs().get(projectId, jobId.getJobId()).execute();
 elapsedTime = System.currentTimeMillis() - startTime;
 System.out.format("Job status (%dms) %s: %s\n", elapsedTime,
   jobId.getJobId(), pollJob.getStatus().getState());
 if (pollJob.getStatus().getState().equals("DONE")) {
   return pollJob;
 }
 // Pause execution for one second before polling job status again
 Thread.sleep(1000);
}

Retrieve and Display Query Results

Now that our query job has completed, we can finally display the result set to the user. The Jobs.getQueryResults() method returns information about our result set, including a table containing the actual result values.

The data from our query result is returned in an list of "rows" represented by the TableRow class. Each TableRow object contains a list of TableCells, and each TableCell contains a single value. Let's import classes that help us retrieve values from our query response.

import com.google.api.services.bigquery.model.TableRow;
import com.google.api.services.bigquery.model.TableCell;

In order to display the complete result of our query, we iterate over the list of TableRows in the result set, and then iterate over the list of fields contained in each individual TableRow.

GetQueryResultsResponse queryResult = bigquery.jobs()
  .getQueryResults(
    projectId, completedJob
      .getJobReference()
      .getJobId()
    ).execute();
  List rows = queryResult.getRows();
  System.out.print("\nQuery Results:\n------------\n");
  for (TableRow row : rows) {
    for (TableCell field : row.getF()) {
      System.out.printf("%-50s", field.getV());
    }
  System.out.println();
}

Complete Source Code and Further Steps

The entire source code for this sample is available here. This sample is meant to describe the basic steps of getting started with using the Google BigQuery API and the Google Java API Client library. A more complete Java sample, covering all available API methods, is available here.

It's also possible to use the Google Java API Client library on other platforms, such as the Java runtime for Google App Engine. We'll cover this use case in a future article.

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.