Building an End-to-End Deep Learning GitHub Discovery Feed
Blog post from Stream
A data scientist at Stream undertook a project to create a GitHub recommendation system by leveraging deep learning architectures and big data processing tools like Dask and PyTorch. The project involved processing over 600 million events from the GitHub Archive, a dataset encompassing public activity data for numerous repositories but excluding private information. The scientist aimed to use implicit data from various event types to generate recommendations, opting for a neural matrix factorization model and sequence-based models to account for users' diverse tastes. The process involved downloading data, converting it into a suitable format, and building a recommendation model that could provide real-time suggestions by analyzing users' past interactions. The model was trained using Bayesian Personalized Loss with an emphasis on efficient candidate generation and ranking to ensure quick response times in a production environment, eventually utilizing Spotify's Annoy library for approximate nearest neighbors to enhance performance. The recommender system was deployed using Django Rest Framework, allowing users to receive recommendations based on their recent GitHub activity, demonstrating the potential impact and application of advanced data processing and machine learning techniques in enhancing user experience.