How to ship a local LLM that matches frontier LLMs with evals and prompt engineering
Blog post from Arize
In an effort to reduce costs and improve efficiency, the author explores the use of smaller, local language models (SLMs) as alternatives to frontier models for specific AI tasks within a social and news app named Mima. By employing capability evaluations (evals) and prompt engineering, they successfully deploy a local 3B model that matches the performance of the Claude Sonnet model for summarization tasks while running faster and incurring no additional call costs. The text details the process of selecting a Small And Good Enough (SAGE) model through a methodical four-step framework that includes prototyping with a state-of-the-art model, setting success criteria, testing a range of models, and choosing the most suitable one based on a balance of accuracy and latency. The author highlights the importance of evals in assessing model capabilities and emphasizes the role of prompt engineering in mitigating issues such as hallucination rates. The use of deterministic solutions and engineering techniques to enhance model performance without additional inference costs is also discussed, alongside the implementation of regression evals to maintain model output quality over time.