How we evaluate AI models and LLMs for GitHub Copilot

Post Details

Company

GitHub

Date Published

Jan. 17, 2025

Author

Connor Adams, Klint Finley

Word Count

1,287

Language

English

Hacker News Points

-

Source URL

github.blog/ai-and-ml/generative-ai/how-we-evaluate-models-for-github-copilot

Summary

GitHub has expanded the AI models available in GitHub Copilot by incorporating Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview and o1-mini models. The focus remains on evaluating AI models for performance, quality, and safety through offline evaluations before integrating them into production. These evaluations combine automated tests for scalability and manual testing for subjective quality assessments, ensuring a comprehensive analysis of model capabilities. GitHub Copilot prioritizes responsible AI development by testing models for relevance, toxicity, and safety, using over 4,000 offline tests and internal evaluations. This involves assessing models' abilities to modify codebases and provide accurate technical answers, while using another LLM to verify responses. The evaluations help determine whether a model should be adopted, balancing factors like acceptance rates and latency. GitHub Models platform facilitates the use and comparison of various AI models, supporting the goal of creating a high-quality, responsible AI coding assistant.