Home / Companies / GitHub / Blog / Post Details
Content Deep Dive

How we evaluate AI models and LLMs for GitHub Copilot

Blog post from GitHub

Post Details
Company
Date Published
Author
Connor Adams, Klint Finley
Word Count
1,287
Language
English
Hacker News Points
-
Summary

GitHub has expanded the AI models available in GitHub Copilot by incorporating Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview and o1-mini models. The focus remains on evaluating AI models for performance, quality, and safety through offline evaluations before integrating them into production. These evaluations combine automated tests for scalability and manual testing for subjective quality assessments, ensuring a comprehensive analysis of model capabilities. GitHub Copilot prioritizes responsible AI development by testing models for relevance, toxicity, and safety, using over 4,000 offline tests and internal evaluations. This involves assessing models' abilities to modify codebases and provide accurate technical answers, while using another LLM to verify responses. The evaluations help determine whether a model should be adopted, balancing factors like acceptance rates and latency. GitHub Models platform facilitates the use and comparison of various AI models, supporting the goal of creating a high-quality, responsible AI coding assistant.