Home / Companies / Together AI / Blog / Post Details
Content Deep Dive

Announcing Together Inference Engine 2.0 with new Turbo and Lite endpoints

Blog post from Together AI

Post Details
Company
Date Published
Author
Together AI
Word Count
1,802
Language
English
Hacker News Points
3
Summary

The Together Inference Engine 2.0 introduces new Turbo and Lite endpoints, providing faster decoding throughput and higher quality models than commercial solutions. The new endpoints offer performance, quality, and price flexibility, allowing enterprises to scale their applications without compromising on any aspect. With the release of Together Turbo and Together Lite, developers can now build Generative AI applications at production scale with the fastest engine for Nvidia GPUs and the most accurate and cost-efficient solution. The engine achieves over 400 tokens per second on Meta Llama 3 8B by leveraging advanced techniques such as FlashAttention-3, faster GEMM & MHA kernels, innovations in quality-preserving quantization, and speculative decoding. The new endpoints are available starting today for Llama 3 models, with plans to roll out across other models soon.