How we engineered RAG to be 50% faster

Company

ElevenLabs

Date Published

Sept. 12, 2025

Author

Michal Korbela

Word count

644

Language

English

Hacker News points

None

URL

elevenlabs.io/blog/engineering-rag

Summary

RAG, or Retrieval-Augmented Generation, enhances AI agent accuracy by grounding responses in extensive knowledge bases without overwhelming the model. The system employs a query rewriting step to transform vague user requests into precise queries for retrieval, improving response accuracy but posing latency challenges. To address this, the team implemented a "model racing" approach, sending queries to multiple models simultaneously and using the first valid response, significantly reducing median latency from 326ms to 155ms. This architecture not only halves RAG latency but also increases system resilience, ensuring continuity during external model outages and allowing the system to operate efficiently over large knowledge bases, thus facilitating scalable and real-time conversational agents without performance compromise.