AdapTive-LeArning Speculator System (ATLAS): A New Paradigm in LLM Inference via Runtime-Learning Accelerators
Blog post from Together AI
Together AI is enhancing the performance of large language models through its Adaptive-Learning Speculator System (ATLAS), part of the Together Turbo inference suite. ATLAS is designed to automatically improve performance without manual tuning by dynamically adapting to real-time usage patterns, unlike traditional static or custom-trained speculators. This system employs two cooperating speculators—a static one trained on a broad corpus and a lightweight adaptive one that updates with real-time traffic—guided by a confidence-aware controller to optimize speculation lookahead and enhance speed and accuracy. ATLAS has demonstrated significant performance gains, such as achieving up to 500 tokens per second on DeepSeek-V3.1, outperforming specialized hardware by dynamically aligning with evolving workloads. This advancement underscores Together AI's commitment to delivering scalable, efficient AI systems that are continuously optimized for speed and adaptability, thereby reducing latency and ensuring high-quality output in varied and rapidly changing environments.