Company
Date Published
Author
Together AI
Word count
880
Language
English
Hacker News points
2

Summary

The Together Inference Engine is a fast inference stack that outperforms other services by up to 3x when running on the same hardware, with performance of 117 tokens per second on Llama-2-70B-Chat and 171 tokens per second on Llama-2-13B-Chat. It is built on CUDA and runs on NVIDIA Tensor Core GPUs, utilizing techniques such as FlashAttention-2, Flash-Decoding, and Medusa to optimize inference performance. The engine achieves results comparable to the reference Hugging Face implementation without compromising quality. New features include Serverless Endpoints with automatically added capacity and scaling, Dedicated Instances for custom models, Auto-scaling for increased flexibility, and expanded model availability. Pricing has been lowered due to efficiency gains, making it 20% cheaper than some competitors while offering faster performance.