LLM Evaluation vs Software testing: why your existing QA process doesn't work | Galtea Blog
Blog post from Galtea
Traditional software testing methodologies and mental models are not effective for evaluating language models (LLMs) due to the inherent differences in how these models function compared to typical software systems. Unlike deterministic software, LLMs produce probabilistic outputs that vary even with the same input, challenging the assumption that identical inputs yield identical outputs. Quality in LLMs is multidimensional and cannot be captured by binary pass/fail tests, as responses may be partially correct but flawed in subtle ways that degrade the user experience. Additionally, LLMs can change behavior without code modifications due to updates in model weights by providers, and their performance can vary with input distribution shifts, thereby necessitating continuous monitoring rather than static test coverage. Furthermore, defining quality for LLMs often requires domain expertise beyond the engineering team, making rubric-based scoring and domain-specific evaluation criteria essential. As such, LLM evaluation involves a distinct process that includes methods like rubric-based scoring, reference comparison, and ongoing production monitoring to ensure the model's outputs meet application needs across diverse real-world inputs.