Introducing the CodeSearchNet challenge

Post Details

Company

GitHub

Date Published

Sept. 26, 2019

Author

Hamel Husain

Word Count

660

Language

English

Hacker News Points

-

Source URL

github.blog/engineering/infrastructure/introducing-the-codesearchnet-challenge

Summary

Researchers have introduced the CodeSearchNet Challenge to advance code search capabilities using machine learning, acknowledging the lack of standardized benchmarks for evaluating these models. In collaboration with Weights & Biases, they have released a large dataset, the CodeSearchNet Corpus, which includes millions of functions with associated documentation across multiple programming languages such as Go, Java, JavaScript, PHP, Python, and Ruby. This dataset is intended to support the development of machine learning models for code search by pairing code with natural language descriptions, leveraging modern Transformer architectures and self-attentional models. An evaluation environment and leaderboard have been established using an annotated dataset of code search queries, with relevance annotated by programmers, data scientists, and researchers. The project, supported by contributors from Microsoft Research, GitHub, and the broader community, aims to expand its evaluation dataset to include more languages and queries in future iterations, presenting code search as one application among many potential uses of learned representations of code and natural language.