Company
Date Published
Author
Pete Cheslock
Word count
1103
Language
English
Hacker News points
None

Summary

The author has gained access to a large dataset of GitHub users and projects through the GHTorrent project, which is an effort to build an offline version of all data available in the GitHub APIs. The data was made available on Google Big Query for free, and after uploading it to Amazon S3, the CHAOSSEARCH platform quickly indexed the data, allowing the author to analyze and visualize various trends and patterns. The author analyzed the growth of user creation across the entire dataset, which showed a fairly even growth with some large spikes in users getting created, especially towards the end of the year. They also found that the majority of users didn't fill out their location details, but those who did were mostly from the US, India, and China. The author graphed the growth of users over time for these countries and found that user creation started growing around 2011-2012. In terms of state popularity in the US, Massachusetts ranked low due to its high number of large enterprises with limited software development activity, while cities like San Francisco and Austin were more popular among GitHub users. The author was able to use the CHAOSSEARCH platform to quickly search for their own user accounts using prefix and postfix wildcard queries, which allowed them to find both of their user accounts within seconds. In their next post, they will continue analyzing the dataset to learn more about the projects that these users are creating in GitHub.