Mastering AI quality: How we use language model evaluations to improve large language model output quality
Blog post from Webflow
Webflow's AI teams have encountered challenges when developing features on large language models (LLMs) due to the inherent complexity and probabilistic nature of these models. Traditional software engineering practices, like integration and unit testing, fall short in evaluating LLMs, prompting the need for model evaluations that assess the quality of outputs probabilistically rather than deterministically. These evaluations can be subjective or objective, with subjective ones relying on human judgment to rate the model's performance on a scale, while objective evaluations check for specific conditions. Automating these evaluations, particularly subjective ones, involves using AI to grade AI outputs, which can be efficient but may not always align perfectly with human judgment. Therefore, a combination of automated and manual evaluations is recommended to ensure accuracy and reliability. Webflow has incorporated evaluations extensively into its development lifecycle, using them to align stakeholders and improve the quality of AI features by providing a structured way to discuss and address quality issues, as opposed to traditional bug-tracking methods.