Editor’s Note: This article is written by guest author Pradeep Kumar. Pradeep is a technical architect at redBus, an online travel agency in India which provides a unified online bus ticketing service. We published a business case study (PDF) for redBus and wanted to followup with more technical detail for developers.
Our company has been providing Internet bus ticketing for India since 2006. There are more than 10,000 bus routes available for booking, and we have dozens of machines processing booking requests.
One of the most common problems that both startups and big organizations are trying to solve is making sense of these ever-growing collections of data and translating it into a business advantage. This requires creating a data crunching platform where they can dump enormous amount of data and analyze it. The advent of open source big data frameworks like Hadoop, Hive, Pig etc has enabled even small development teams to setup infrastructure capable of crunching terabytes of data and build a data analyzing framework around it.
We were faced with a similar need — we wanted to collect each and every event happening across our applications to correlate and analyze them as easily and efficiently as possible. We wanted to analyze customer activity to monitor which routes needed more buses, whether decreases in customer bookings on specific routes are the result of server problems or simply less demand. We also wanted to monitor our web application and server logs to identify other types of issues.
We explored solutions based on Hadoop, Hive and Pig, based on recommendations from colleagues at other companies. We found they would work well but still required considerable amount of customization work from our end to make it simple for our engineers, MIS and support teams to query the data. Although the software was available freely, we would have still required a couple of new engineers to manage the software and systems. However, these systems would result in considerably high query times (around 10-30 minutes) for 100GB of data. We wanted to be able to expand to more than 500GB of data and get results in less than a minute. Although theoretically this was achievable through Hadoop, Hive, Pig and other open source technologies, it wasn’t practical because it would have required considerable in house expertise with these frameworks and huge up-front infrastructure investments. This would have also meant significant deviation from our core expertise.
But at this point we remembered a discussion we had with a Googler about their BigQuery data analytics service and got in touch with them for trusted tester access.
How BigQuery works
In order to store and analyze your data in BigQuery, you first create a table. This process is similar to that of creating a table in a relational database, as you define a schema to describe your data when you create a BigQuery table. You then upload your data in CSV format and kick off the import process, with few limits on the amount of data that can be loaded into the table. We had a table which contained 2TB of data but still returned query results in under 30 seconds for most queries.
When we first started loading data into BigQuery, we didn’t know all of the different types of data we’d want to analyze. This resulted in a challenge for us using BigQuery because it doesn’t allow us to alter the table schema. We didn’t want to create new tables every time we wanted to push something new.
Because of BigQuery’s columnar data storage, the number of columns in the table doesn’t have any effect on the data crunching performance. We ended up creating a table with 1000's of columns and whenever we came up with a new key to be tracked we just mapped the key to one of the columns.
After only a few days of coding we had our application set up and running.
Client and sample use cases
After we did our initial data load, we needed to configure our services to push data in an automated process. Most of our apps are in Java or .NET so we wrote up two small clients which our applications could use to start pumping events into BigQuery.
One type of event we tracked was a customer search. In Java, the search would be stored in a HashMap like this:
map = new HashMap (); map.put(Keys.EVENT_TIME,System.CurrentTimeinMillis().toString()); map.put(Keys.EVENT_NAME,"BUS_SEARCH"); map.put(Keys.APP_LOCAL_IP,getLocalIp()); map.put(Keys.APP_NAME,"REDBUS_WEBSERVER"); map.put(Keys.SEATS_SHOWN,seatsCount);
The event would then be queued for importing into BigQuery using the custom client:
The customer client then serializes these events using a RabbitMQ client which pushes them to a central RabbitMQ server. The server transforms the events into a CSV file which is periodically uploaded to Google Cloud Storage and then loaded into BigQuery. We currently upload the queued events every ten minutes so we have the latest production data available for very fast analysis.
After implementing tracking for all of our internal API calls, we then used the resulting BigQuery tables to build custom system monitoring dashboards:
Having all the data in one place meant we could correlate multiple metrics with ease:
Sample business queries
We have a large number of third-party data sources, so the speed of discovering new buses is essential to displaying accurate inventory on our site. The following query gives new buses discovered in a specified time period:
SELECT FORMAT_UTC_USEC(UTC_USEC_TO_DAY((event_time)*1000)) EventTime, lis_route_id,lis_vservice_id,travels_name FROM yyy.proddata WHERE app_name = 'LIS' AND event_name = 'LIS_NEW_ROUTE_DISCOVERY' AND (event_time)*1000 >= PARSE_UTC_USEC('2012-05-31 00:00:00') AND (event_time)*1000 < PARSE_UTC_USEC('2012-06-01 00:00:00')
The below query returns the number of times we ran out of buses over particular routes during a several day period, giving us an idea of how well we are serving the demand:
SELECT FORMAT_UTC_USEC(UTC_USEC_TO_DAY((event_time)*1000)) EventTime, FORMAT_UTC_USEC(UTC_USEC_TO_DAY((date_of_journey)*1000)) doj, source,destination,count(*) Count FROM yyyy.proddata WHERE app_name = 'LIS' AND event_name = 'LIS_PUBLIC_SEARCH' AND number_of_routes = 0 AND (event_time)*1000 >= PARSE_UTC_USEC('2012-05-25 00:00:00') AND (event_time)*1000 < PARSE_UTC_USEC('2012-05-28 00:00:00') AND (date_of_journey)*1000 >= PARSE_UTC_USEC('2012-05-29 00:00:00') AND (date_of_journey)*1000 < PARSE_UTC_USEC('2012-06-29 00:00:00') GROUP BY EventTime, doj, source, destination ORDER BY Count DESC
Getting to the root of problems used to be really time-consuming. By the time we figured it out, customers might have given up. Now if there are booking problems, BigQuery helps us understand the reason right away.
BigQuery helped us focus on deriving business and other insights from our data quickly by making big data crunching and analysis ridiculously simple. Queries that would have required a day to analyze on a Hadoop framework take less than 30 seconds on BigQuery. BigQuery also provides this speed and simplicity at a fraction of the cost of maintaining a complex Hadoop infrastructure.
If you want your big data setup for analysis in no time, go ahead and give BigQuery a try.