Home / Companies / Together AI / Blog / Post Details
Content Deep Dive

Together AI delivers fastest inference for the top open-source models

Blog post from Together AI

Post Details
Company
Date Published
Author
Jue Wang, Wai Tong Chung, Chenxi Li, Chandra Mourya, John Heo, Shirley Wu, Alaskar Alizada, Rupert Wu, Roy Yuan, Pragaash Ponnusamy, Ben Athiwaratkun, Leon Song
Word Count
870
Language
English
Hacker News Points
-
Summary

The team has focused on enhancing their inference platform to become the fastest for running top open-source models, achieving significant performance improvements as validated by benchmarks from Artificial Analysis. The platform now ranks first in output speed for several demanding models, including GPT-OSS-20B, GPT-OSS-120B, and Qwen-3-235B-Instruct, delivering up to 2.75 times faster performance than competitors. These improvements result from an integrated approach, including advanced GPU hardware exploitation, kernel optimization, low-bit quantization, and a scalable speculative decoding algorithm. The team re-engineered the entire system architecture to maximize the potential of NVIDIA Blackwell GPUs, leveraging high-performance kernels and quantization strategies like FP8 and FP4 to maintain model accuracy while boosting speed. The use of a scalable draft-model training pipeline further supports this performance leap, allowing for efficient speculative decoders and high-acceptance draft models. The company remains committed to advancing open-source AI model performance and scalability, with ongoing research into faster generation strategies and hybrid quantization approaches.