GitHub Copilot research recitation
Blog post from GitHub
GitHub Copilot, an AI tool trained on extensive public code, is evaluated for its tendency to suggest code snippets that are directly derived from its training data. An internal trial involving nearly 300 GitHub employees using the tool in their daily work highlighted that while most of Copilot's suggestions are unique, a small percentage are direct recitations from its training set, commonly appearing in generic contexts or at the initial stages of file creation. The study found that such recitations occur roughly once every ten user weeks, predominantly involving code snippets frequently encountered in public repositories, like the GNU General Public License. The research suggests implementing a UI feature to indicate when a suggestion is a direct quote, allowing users to decide on appropriate attribution. Future efforts aim to reduce recitation rates and enhance detection precision, with the intention of integrating these improvements into GitHub Copilot's technical preview.