Nick Mihailovski, Google Analytics API Team – October 2009
This article discusses how to detect and backfill missing time series values in data returned from the Google Analytics Data Export API.
Before You Begin
The article assumes you know how the Google Analytics Data Export API works. The sample code is in Java but you can use the concepts in your language of choice. The code for this article is provided as open source and can be downloaded from project hosting.
After reading this article, you will learn:
- How the Google Analytics Data Export API treats date dimensions.
- How to structure your queries to group results and detect missing dates.
- How to fill in the missing values using Java.
Introduction
Comparing data across a period of time provides context.
For example, stating a website generated $1 million in revenue
doesn’t mean much. But stating a website increased revenue by 10x quarter
over quarter or year over year is indeed impressive. With the Google
Analytics API, it’s easy to plot data over time by using the
ga:date
, ga:day
and ga:month
dimensions.
If your query only uses a date dimension, if any days in the date range collected zero
data, the Google Analytics API will backfill dates and 0
values for metrics.
ga:date | ga:sessions |
---|---|
2010-03-01 | 101 |
2010-03-02 | 0 |
2010-03-03 | 69 |
However it becomes tricky if you query date along with other dimensions. If one of the dates has no data, the API will NOT return an entry for that date. It will just skip to the next available date that contains data.
ga:keyword | ga:date | ga:sessions |
---|---|---|
chair | 2010-03-01 | 55 |
chair | 2010-03-03 | 48 |
Ideally analysts would want to have the missing dates for a particular keyword filled in like the first example above
This article describe some best practices for pragmatically backfilling data.
Background
Let's first look at why this problem exists. There are 2 reasons.
- Google Analytics only processes data that is collected. If nobody came to a site on a particular day, then there is no data to process, so no data is returned.
- It's very difficult to determine how many additional dimensions and what values should be used for dates that have no data.
So instead of trying to define one process to rule them all, the Google Analytics API leaves the exercise of filling in data for queries that have multiple dimensions up to the developer. Lucky You :)
Program Overview
Here are the steps to backfill the data in the chart above.
- Modify the query to ensure dimensions are sorted opportunistically.
- Determine the expected dates from the date range.
- Iterate and backfill any missing dates.
- Fill in any remaining missing values.
Modify The Query
To backfill dates, we need to make sure the data returned from the API
is in a format that makes it easy to detect when a date is missing.
Here’s an example query to retrieve both ga:keyword
and ga:date
for the first 5 days in March:
DataQuery dataQuery = new DataQuery(new URL(BASE_URL)); dataQuery.setIds(TABLE_ID); dataQuery.setStartDate("2010-03-01"); dataQuery.setEndDate("2010-03-05"); dataQuery.setDimensions("ga:keyword,ga:date"); dataQuery.setMetrics("ga:entrances");
Once the query is sent to the API, the results will contain a list
of DataEntry
objects. Each entry object represents a row of
data and includes names and values for dimensions/metrics. Since no
sort parameter was used, the results are returned in an arbitrary order.
ga:keyword | ga:date | ga:entrances |
---|---|---|
chair | 2010-03-04 | 14 |
chair | 2010-03-01 | 23 |
table | 2010-03-04 | 18 |
table | 2010-03-02 | 24 |
chair | 2010-03-03 | 13 |
To make it easy to identify which dates are missing, we need to first group all dimensions together. This can be done by setting the sort parameter of the query to the dimensions used in the original query.
dataQuery.setSort("ga:keyword,ga:date");
Adding the sort parameter will make the API return the results in the desired order.
ga:keyword | ga:date | ga:entrances |
---|---|---|
chair | 2010-03-01 | 23 |
chair | 2010-03-03 | 13 |
chair | 2010-03-04 | 14 |
table | 2010-03-02 | 24 |
table | 2010-03-04 | 18 |
The second step is to make sure that for every dimension, all dates are returned
in ascending order. While the Google Analytics API provides a number of date
dimensions, only ga:date
can be sorted accurately across date
boundaries (i.e., days, months, years). So if you want to backfill dates, make sure
your query uses the ga:date
dimension in both
the dimensions and sort query parameters.
Once the sorted query is executed, all of the same landing pages will be returned next to each other and the dates will be in sequential order. The list of dates for a single landing page can be thought of as a time series and because they are in order, it is much easier to identify missing dates.
Determine Expected Dates
To detect missing dates, we need to compare the actual dates returned from the API to the expected dates in every time series. We can figure out what is expected by:
- Determining the expected start date from the API query.
- Counting the number expected days in the query date range.
Both values can be used together to determine each expected date by incrementing the start date by 1 for each day in the date range.
Determining the Expected Start Date
We can use the start-date
query parameter as the expected start date
of the series. Because the date format returned in the API response yyyyMMdd
is different than the format of the query parameter yyyy-MM-dd
,
we need to first convert the date format before we can use it.
The setExpectedStartDate
method converts the formats of the dates.
private static SimpleDateFormat queryDateFormat = new SimpleDateFormat("yyyy-MM-dd"); private static SimpleDateFormat resultDateFormat = new SimpleDateFormat("yyyyMMdd"); public void setExpectedStartDate(String startDate) { try { calendar.setTime(queryDateFormat.parse(startDate)); expectedStartDate = resultDateFormat.format(calendar.getTime()); } catch (ParseException e) { handleException(e); } }
Counting the Number of Expected Days
To get the number of days in the date range, the program parses the start and
end dates into Java Date
objects. Then uses a Calendar
object to figure out the time between both dates. One day is added to the difference
in dates to make the count inclusive.
private static final long millisInDay = 24 * 60 * 60 * 1000; public void setNumberOfDays(DataQuery dataQuery) { long startDay = 0; long endDay = 0; try { calendar.setTime(queryDateFormat.parse(dataQuery.getStartDate())); startDay = calendar.getTimeInMillis() / millisInDay; calendar.setTime(queryDateFormat.parse(dataQuery.getEndDate())); endDay = calendar.getTimeInMillis() / millisInDay; } catch (ParseException e) { handleException(e); } numberOfDays = (int) (endDay - startDay + 1); }
Now we have all the data we need to figure out which dates are missing.
Identify Each Time Series in the Results
Once the query is executed, the program goes through each
DataEntry
object in the API response. Because the query was initially sorted,
the response will have a partial time series for each keyword. So we need
to find the start of each time series, then go through each date and fill
in missing data not returned by the API.
This program uses the dimensionValue
and tmpDimensionValue
variables to detect the start of each series.
Here is the entire code to handle the response. Filling in missing data is discussed below.
public void printBackfilledResults(DataFeed dataFeed) { String expectedDate = ""; String dimensionValue = ""; List<Integer> row = null; for (DataEntry entry : dataFeed.getEntries()) { String tmpDimValue = entry.getDimensions().get(0).getValue(); // Detect beginning of a series. if (!tmpDimValue.equals(dimensionValue)) { if (row != null) { forwardFillRow(row); printRow(dimensionValue, row); } // Create a new row. row = new ArrayList<Integer>(numberOfDays); dimensionValue = tmpDimValue; expectedDate = expectedStartDate; } // Backfill row. String foundDate = entry.getDimension("ga:date").getValue(); if (!foundDate.equals(expectedDate)) { backFillRow(expectedDate, foundDate, row); } // Handle the data. Metric metric = entry.getMetrics().get(0); row.add(new Integer(metric.getValue())); expectedDate = getNextDate(foundDate); } // Handle the last row. if (row != null) { forwardFillRow(row); printRow(dimensionValue, row); } }
Backfill Any Missing Dates
For each entry in a series, the program stores the metric values (entrances) in an
ArrayList
called row
. When a new time series is detected, a new row is created and
the expected date is set to the expected start date.
Then for each entry, the program checks to see if the date value in the entry equals the expected date. If they are equal, the metric in the entry is added to the row. Otherwise, the program has detected missing dates which need to be backfilled.
The backfillRow
method handles backfilling data. It accepts
as parameters the expected and found dates as well as the current row.
It then determines the number of days between the two dates
(non-inclusive) and adds that many 0s to the row.
public void backFillRow(String startDate, String endDate, List<Integer> row) { long d1 = 0; long d2 = 0; try { calendar.setTime(resultDateFormat.parse(startDate)); d1 = calendar.getTimeInMillis() / millisInDay; calendar.setTime(resultDateFormat.parse(endDate)); d2 = calendar.getTimeInMillis() / millisInDay; } catch (ParseException e) { handleException(e); } long differenceInDays = d2 - d1; if (differenceInDays > 0) { for (int i = 0; i < differenceInDays; i++) { row.add(0); } } }
When the method is done, the row has been backfilled with data and the
current data can be added. The expected date is then incremented to one day after
the found date using the getNextDate
method.
public String getNextDate(String initialDate) { try { calendar.setTime(resultDateFormat.parse(initialDate)); calendar.add(Calendar.DATE, 1); return resultDateFormat.format(calendar.getTime()); } catch (ParseException e) { handleException(e); } return ""; }
Fill In Any Remaining Values
Once the series data has been processed into a row
,
we have to check that there are no more missing dates at the end of the series.
The forwardFillRow
method simply calculates the difference between the
number of days in the original query to the current size of the row, and adds that
many 0s to the end of the row.
public void forwardFillRow(List<Integer> row) { int remainingElements = numberOfDays - row.size(); if (remainingElements > 0) { for (int i = 0; i < remainingElements; i++) { row.add(0); } } }
At this point, the program has filled in any missing values in the time series. Now that we have all the data, the program prints the dimension and metric values as a comma separated list.
Conclusion
Using this sample, you can easily backfill data on dates not returned by the API. As mentioned above, this solution can be adapted to any programming language. Developers can even adapt these techniques and apply them to handle multiple dimensions and multiple metrics. Now it's even easier than ever to start doing advanced analysis on times series returned by the Google Analytics API.