Using Google BigQuery to learn from GitHub data
Blog post from Google Cloud
GitHub Archive, spearheaded by Ilya Grigorik, is a project designed to record, archive, and analyze public GitHub activities, providing insights into various trends such as programming language popularity and contribution metrics. With over 2.6 million public projects and daily archiving of more than 120,000 activities, the data encompasses new commits, fork events, and ticket updates, each with detailed metadata. To analyze this vast dataset, Grigorik utilized Google BigQuery, a tool based on Google's internal Dremel system that allows for rapid querying of large-scale datasets using SQL-like syntax. Thanks to a collaboration between GitHub and BigQuery, the dataset is publicly available, enabling users to perform their analyses without the complexities of data gathering and database management. The initiative not only facilitates trend analysis but also encourages participation through initiatives like the GitHub Data Challenge.