Benchmarking inference at scale: coding agents
Blog post from Together AI
Together Inference Engine demonstrates significant performance advantages over TensorRT-LLM and SGLang in handling high-concurrency coding agent workloads, offering over 50% more tokens per second (TPS) and twice the time to first token (TTFT) efficiency at saturation on similar hardware. Designed to simulate real production conditions with long inputs and no latency tolerance, the benchmark highlights how different engines manage load, with Together's engine maintaining functionality at higher traffic levels compared to its competitors. The Kimi K2.6 model, available on the Together platform, matches the coding benchmarks of Claude Opus 4.6 at a substantially lower cost—76% cheaper per request—providing a cost-effective solution for large-scale operations. The study emphasizes the importance of realistic benchmarks and detailed optimization techniques, such as the ThunderMLA kernel, which significantly enhance performance by reducing overhead and improving execution efficiency, making Together's engine a robust choice for high-demand environments.