Home / Companies / GitHub / Blog / Post Details
Content Deep Dive

C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages

Blog post from GitHub

Post Details
Company
Date Published
Author
Kavita Ganesan
Word Count
1,660
Language
English
Hacker News Points
-
Summary

GitHub hosts a vast array of over 300 programming languages, necessitating robust language recognition for purposes such as search, security alerts, and syntax highlighting. Traditional methods like using file extensions can be unreliable due to ambiguities and inconsistencies. To address this, GitHub initially used Linguist, a Ruby-based tool that employs naming conventions and heuristics, achieving 84% accuracy but struggling with non-standard naming conventions and missing extensions. To enhance language detection, GitHub developed OctoLingua, a machine learning classifier based on an Artificial Neural Network (ANN) architecture that outperforms Linguist by focusing on code vocabulary over file extensions. OctoLingua uses data from Rosetta Code and quality repositories, employing features such as special characters and tokens to train its model. It has proven to be more robust in classifying programming languages across various scenarios, including when file extensions are scrambled or missing. The development of OctoLingua aims to provide a reliable service for source code language detection at multiple granularity levels, with future plans to expand language support and potentially open-source the model to engage with the wider community.