Thursday, May 3, 2012

Using Google BigQuery to learn from GitHub data

Author Photo
By Ilya Grigorik, Web Performance Engineer

Open-source developers all over the world contribute to millions of projects every day: writing and reviewing code, filing and discussing bug reports, updating documentation and project wikis, and so forth. The data generated from this activity can reveal interesting trends across many industries, including popularity of programming languages over time, defect rates, contribution metrics, and popularity of specific frameworks and libraries.

The challenge in extracting these trends is gathering the data. Each project has its own distributed workflow, code repositories, and conventions. Having hosted dozens of my own projects on GitHub, I've long wanted to analyze the developer activity from the 2.6M+ public projects hosted on GitHub. Hence, earlier this year GitHub Archive was born!

GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. Each day it archives over 120,000 public activities, ranging from new commits and fork events to opening and closing tickets, each with detailed metadata.

Once I collected the data, I needed a tool to analyze it, and that is when I found Google BigQuery. Based on the research behind Dremel, a popular internal tool at Google for analyzing web-scale datasets, BigQuery allowed me to easily import the entire dataset and use a familiar SQL like syntax to comb through the gigabytes of data in seconds. Plus the tool will scale to terabyte datasets, so there is plenty of room to grow!

The best news is that thanks to collaboration from the GitHub and BigQuery teams, the GitHub dataset is now public and available for you to slice and dice in any way you like. No need to worry about data gathering or database schemas: BigQuery will do all the heavy lifting, and you can just compose your queries to be executed in realtime.

Here's a real-world example. What are the most popular programming languages on GitHub over the past month?


chart showing number of commits by language

If you are curious for more, sign up for BigQuery and follow the instructions on githubarchive.org to access the GitHub dataset. You can use the free 100GB query quota to run your analysis and perhaps even win some of the prizes from the GitHub Data Challenge!


Ilya Grigorik is a Web Performance Engineer and Advocate at Google, an open-source evangelist, and an analytics geek. You can find him on GitHub under igrigorik, and blogging about web performance at igvita.com.

Posted by Scott Knaster, Editor

No comments:

Post a Comment