How we evaluate AI models and LLMs for GitHub Copilot
Blog post from GitHub
GitHub has expanded the AI models available in GitHub Copilot by incorporating Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview and o1-mini models. The focus remains on evaluating AI models for performance, quality, and safety through offline evaluations before integrating them into production. These evaluations combine automated tests for scalability and manual testing for subjective quality assessments, ensuring a comprehensive analysis of model capabilities. GitHub Copilot prioritizes responsible AI development by testing models for relevance, toxicity, and safety, using over 4,000 offline tests and internal evaluations. This involves assessing models' abilities to modify codebases and provide accurate technical answers, while using another LLM to verify responses. The evaluations help determine whether a model should be adopted, balancing factors like acceptance rates and latency. GitHub Models platform facilitates the use and comparison of various AI models, supporting the goal of creating a high-quality, responsible AI coding assistant.