Offline vs. Online LLM evaluation: what each catches, what each misses | Galtea Blog
Blog post from Galtea
Offline and online evaluations serve distinct purposes in assessing the performance of large language models (LLMs), with each method identifying different issues. Offline evaluation is conducted before deployment using a fixed dataset to catch regressions caused by internal changes, effectively acting as a quality gate. However, it cannot detect external changes such as silent model updates or input drift that occur in the live system. Online evaluation, on the other hand, monitors the model's performance on real user traffic and identifies issues arising from external factors, but it is constrained by scalability and the absence of ground truth. Both evaluations are necessary for maintaining production quality, as they complement each other by addressing different failure modes. A combined approach ensures that regressions are caught before deployment while ongoing monitoring detects shifts in the live environment, preventing unnoticed degradation until users report issues. Techniques like canary evaluation and shadow scoring help transition between offline and online evaluations by testing changes on a subset of real traffic. Implementing effective sampling strategies and embedding-based drift detection further enhances the reliability of online evaluation.
No tracked trend matches for this post yet.