Together AI delivers fastest inference for the top open-source models
Blog post from Together AI
The team has focused on enhancing their inference platform to become the fastest for running top open-source models, achieving significant performance improvements as validated by benchmarks from Artificial Analysis. The platform now ranks first in output speed for several demanding models, including GPT-OSS-20B, GPT-OSS-120B, and Qwen-3-235B-Instruct, delivering up to 2.75 times faster performance than competitors. These improvements result from an integrated approach, including advanced GPU hardware exploitation, kernel optimization, low-bit quantization, and a scalable speculative decoding algorithm. The team re-engineered the entire system architecture to maximize the potential of NVIDIA Blackwell GPUs, leveraging high-performance kernels and quantization strategies like FP8 and FP4 to maintain model accuracy while boosting speed. The use of a scalable draft-model training pipeline further supports this performance leap, allowing for efficient speculative decoders and high-acceptance draft models. The company remains committed to advancing open-source AI model performance and scalability, with ongoing research into faster generation strategies and hybrid quantization approaches.