Thursday, 14 March 2013
Facebook StumbleUpon Twitter Google+ Pin It

BigQuery gets big new features to make data analysis even easier

Author Photo
By Michael Manoochehri, Developer Programs Engineer, Cloud Platform

Google BigQuery is designed to make it easy to analyze large amounts of data quickly. Overwhelmingly, developers have asked us for features to help simplify their work even further. Today we are launching a collection of updates that gives BigQuery a greater range of query and data types, more flexibility with table structure, and better tools for collaborative analysis.

Big JOIN and Big Group Aggregations

Extracting insights from multiple datasets can be challenging and time-consuming. This is especially true when datasets become too large to query with a traditional database system. With traditional databases, SQL functions like joining and grouping are typically used to bring together data for analysis. What happens when your data is too large to fit into a conventional database? Working with multi-terabyte datasets often requires developing complicated MapReduce workflows, investing in expensive infrastructure, and great deal of time. Very often, it's a combination of all three.

In response to developer feedback, we're launching new features that enable analysts and developers to run fast SQL-like join and aggregate queries on datasets without the need for batch-based processing. Our new Big JOIN feature gives users the ability to produce a result set by merging data from two large tables by a common key. Big JOIN simplifies data analysis that would otherwise require a data transformation step, by allowing users to specify JOIN operations using SQL.

Popular web applications produce user activity logs that can grow by billions of rows each week. Dividing users into smaller groups is a key step for analysis. However, each group of users can number in the millions. To handle this for such large volumes, we've enabled Big Group Aggregations, which significantly increases the number of distinct values that can be grouped in a result set.

To use these new features, simply add the EACH modifier to JOIN or GROUP BY clauses.


/* JOIN EACH example
* Selects the top 10 most edited Wikipedia pages
* of words that appear in works of Shakespeare.
*/

SELECT
 TOP(wiki.title, 10), COUNT(*)
FROM
 [publicdata:samples.wikipedia] AS wiki
JOIN EACH
 [publicdata:samples.shakespeare] AS shakespeare
ON
 shakespeare.word = wiki.title;


For more information, including best practices, when using JOIN EACH and GROUP EACH BY, visit the BigQuery query reference.

Native support for TIMESTAMP data type

We are also adding a new TIMESTAMP data type, in response to one of our most frequent requests from developers. This new data type lets you import date and time values in formats familiar to users of databases such as MySQL, while preserving timezone offset information.

Along with the new data type come new functions for converting TIMESTAMP fields into other formats, calculating intervals, and extracting components such as the hour, day of week, and quarter.


/* TIMESTAMP example
* Which hours in the day are the most popular for GitHub actions?
* This query converts github_timeline "created_at" date time   
* strings to BigQuery TIMESTAMP, and extracts the hour from each.
*/

SELECT
 HOUR(TIMESTAMP(created_at)) AS event_create_hour,
 COUNT(*) AS event_count
FROM
 [publicdata:samples.github_timeline]
GROUP BY
 event_create_hour
ORDER BY
 event_count DESC;


Read more about the available TIMESTAMP functions in our query reference guide.

Add columns to existing BigQuery tables

When working with large amounts of fast moving data, it's not uncommon to find out that you need to add additional fields to your tables. In response to developer feedback, we have added the ability to add new columns to existing BigQuery tables.

To take advantage of this feature, simply provide a new schema with additional columns using either the "Tables: update" or "Tables: patch" BigQuery API methods.

For more information on this feature, visit the BigQuery API reference.

BigQuery Web UI: Dataset links and dataset sharing notifications

BigQuery has always provided project owners with very fine-grained control of how their datasets are shared. To make it easier for teams to work on collaborative data analysis, we've added direct links to individual datasets in the BigQuery Web UI. This provides a convenient way for authorized users to quickly access a dataset, and allows for bookmarking and sharing.

In addition, we've also added email notifications to inform users when they've been given dataset access privileges. When a dataset has been shared with another user via the sharing control panel, BigQuery sends a notification email containing a direct link to the dataset.


The BigQuery UI features a collection of public datasets for you to use when trying out these new features. To get started, visit our sign up page and Quick Start guide. You should take a look at our API docs, and ask questions about BigQuery development on Stack Overflow. Finally, don't forget to give us feedback and join the discussion on our Cloud Platform Developers Google+ page.


Michael Manoochehri is a Developer Programs Engineer supporting the Google Cloud Platform. His goal is to help make cloud computing and data analysis universally accessible and useful.

Posted by Scott Knaster, Editor

No comments: