GitHub Copilot research recitation

Post Details

Company

GitHub

Date Published

June 30, 2021

Author

Albert Ziegler

Word Count

2,124

Language

English

Hacker News Points

-

Source URL

github.blog/ai-and-ml/github-copilot/github-copilot-research-recitation

Summary

GitHub Copilot, an AI tool trained on extensive public code, is evaluated for its tendency to suggest code snippets that are directly derived from its training data. An internal trial involving nearly 300 GitHub employees using the tool in their daily work highlighted that while most of Copilot's suggestions are unique, a small percentage are direct recitations from its training set, commonly appearing in generic contexts or at the initial stages of file creation. The study found that such recitations occur roughly once every ten user weeks, predominantly involving code snippets frequently encountered in public repositories, like the GNU General Public License. The research suggests implementing a UI feature to indicate when a suggestion is a direct quote, allowing users to decide on appropriate attribution. Future efforts aim to reduce recitation rates and enhance detection precision, with the intention of integrating these improvements into GitHub Copilot's technical preview.