Monday, 1 October 2012
Facebook StumbleUpon Twitter Google+ Pin It

Got big JSON? BigQuery expands data import for large scale web apps

Author PhotoBy Ryan Boyd, Developer Advocate

JSON is the data format of the web. JSON is used to power most modern websites, is a native format for many NoSQL databases hosting top web applications, and provides the primary data format in many REST APIs. Google BigQuery, our cloud service for ad-hoc analytics on big data, has now added support for JSON and the nested/repeated structure inherent in the data format.

image
JSON opens the door to a more object-oriented view of your data compared to CSV, the original data format supported by BigQuery. It removes the need for duplication of data required when you flatten records into CSV. Here are some examples of data you might find a JSON format useful for:
  • Log files, with multiple headers and other name-value pairs.
  • User session activities, with information about each activity occurring nested beneath the session record.
  • Sensor data, with variable attributes collected in each measurement.
Nested/repeated data support is one of our most requested features. And while BigQuery's underlying infrastructure supports it, we'd only enabled it in a limited fashion through M-Lab's test data. Today, however, developers can use JSON to get any nested/repeated data into and out of BigQuery.

For more information on importing JSON and nested/repeated data into BigQuery, check out the new guide in our documentation. You should also see the Dealing with Data section for details on the new querying syntax available for this type of data.

Improvements to Data Loading Pipeline

We’ve made it much easier to ingest data into BigQuery – up to 1TB of data per load job, with each file up to 100GB uncompressed JSON or CSV. We’ve also eliminated the 2 imports per minute rate limit, enabling you to submit all your ingestion jobs and let us handle the queuing as necessary. In a recent project I’ve been working on, import jobs for 3TB of data that previously took me 12 hours to run now take me only 36 minutes – a 20x improvement!

We’ve published a new Ingestion Cookbook that explains how to take advantage of these new limits.

We’re initiating a small trusted tester program aimed at making it easier to move your data from the App Engine Datastore to BigQuery for analysis. If you store a lot of data in Datastore and are also using BigQuery, we’d like to hear from you. Please sign up now to be considered for the trusted tester program.

Learn more this week

Michael Manoochehri, Siddartha Naidu and I are in London this week talking about BigQuery and these new features at the Strata big data conference. Ju-kay Kwek will also be talking about BigQuery at the Interop NYC conference tomorrow. Please stop by, say hi, and let us know what you’re doing with big data.

We’ll also be producing a Google Developers Live session from Campus London on Friday at 16:00 BST (15:00 GMT).


Ryan Boyd is a Developer Advocate, focused on big data. He's been at Google for 6 years and previously helped build out the Google Apps ISV ecosystem. He published his first book, "Getting Started with OAuth 2.0", with O'Reilly.

Posted by Scott Knaster, Editor

No comments: