Home / Companies / Google Cloud / Blog / Post Details
Content Deep Dive

Using Google BigQuery to learn from GitHub data

Blog post from Google Cloud

Post Details
Company
Date Published
Author
-
Word Count
496
Language
English
Hacker News Points
-
Summary

GitHub Archive, spearheaded by Ilya Grigorik, is a project designed to record, archive, and analyze public GitHub activities, providing insights into various trends such as programming language popularity and contribution metrics. With over 2.6 million public projects and daily archiving of more than 120,000 activities, the data encompasses new commits, fork events, and ticket updates, each with detailed metadata. To analyze this vast dataset, Grigorik utilized Google BigQuery, a tool based on Google's internal Dremel system that allows for rapid querying of large-scale datasets using SQL-like syntax. Thanks to a collaboration between GitHub and BigQuery, the dataset is publicly available, enabling users to perform their analyses without the complexities of data gathering and database management. The initiative not only facilitates trend analysis but also encourages participation through initiatives like the GitHub Data Challenge.