Company
Date Published
Author
Abu Qader 2 others
Word count
1520
Language
English
Hacker News points
None

Summary

The blog post discusses the launch and technical advancements of the Model API for Kimi K2 Thinking, an open-source model that achieves state-of-the-art performance with a latency of 300 milliseconds and over 140 tokens per second, as measured by Artificial Analysis. Utilizing the Baseten Inference Stack, the model runs on a single 8xB200 node in NVFP4, incorporating Tensor Parallelism, Expert Parallelism, and KV-aware routing to enhance performance and cache reuse rates. Kimi K2 Thinking, comparable to leading models like GPT-5 and Claude Sonnet 4.5, stands out for its speed and cost-effectiveness while bridging the intelligence gap between open-source and closed models. The use of NVIDIA Blackwell GPUs, despite challenges in converting data formats from INT4 to NVFP4, is highlighted as crucial in achieving high throughput and low latency. The deployment includes advanced parallelism strategies and KV cache re-use to optimize processing for complex AI tasks, such as code generation and long-context queries. The post also notes ongoing performance improvements and future enhancements to the model's functionality and quality.