Together AI delivers fastest inference for the top open-source models

Post Details

Company

Together AI

Date Published

Dec. 1, 2025

Author

Jue Wang, Wai Tong Chung, Chenxi Li, Chandra Mourya, John Heo, Shirley Wu, Alaskar Alizada, Rupert Wu, Roy Yuan, Pragaash Ponnusamy, Ben Athiwaratkun, Leon Song

Word Count

870

Language

English

Hacker News Points

-

Source URL

www.together.ai/blog/fastest-inference-for-the-top-open-source-models

Summary

The team has focused on enhancing their inference platform to become the fastest for running top open-source models, achieving significant performance improvements as validated by benchmarks from Artificial Analysis. The platform now ranks first in output speed for several demanding models, including GPT-OSS-20B, GPT-OSS-120B, and Qwen-3-235B-Instruct, delivering up to 2.75 times faster performance than competitors. These improvements result from an integrated approach, including advanced GPU hardware exploitation, kernel optimization, low-bit quantization, and a scalable speculative decoding algorithm. The team re-engineered the entire system architecture to maximize the potential of NVIDIA Blackwell GPUs, leveraging high-performance kernels and quantization strategies like FP8 and FP4 to maintain model accuracy while boosting speed. The use of a scalable draft-model training pipeline further supports this performance leap, allowing for efficient speculative decoders and high-acceptance draft models. The company remains committed to advancing open-source AI model performance and scalability, with ongoing research into faster generation strategies and hybrid quantization approaches.