Announcing Together Inference Engine – the fastest inference available

Company

Together AI

Date Published

Nov. 13, 2023

Author

Together AI

Word count

880

Language

English

Hacker News points

URL

www.together.ai/blog/together-inference-engine-v1

Summary

The Together Inference Engine is a fast inference stack that outperforms other services by up to 3x when running on the same hardware, with performance of 117 tokens per second on Llama-2-70B-Chat and 171 tokens per second on Llama-2-13B-Chat. It is built on CUDA and runs on NVIDIA Tensor Core GPUs, utilizing techniques such as FlashAttention-2, Flash-Decoding, and Medusa to optimize inference performance. The engine achieves results comparable to the reference Hugging Face implementation without compromising quality. New features include Serverless Endpoints with automatically added capacity and scaling, Dedicated Instances for custom models, Auto-scaling for increased flexibility, and expanded model availability. Pricing has been lowered due to efficiency gains, making it 20% cheaper than some competitors while offering faster performance.