How to serve Kimi-K2-Instruct on Lambda with vLLM

Post Details

Company

Lambda

Date Published

Dec. 22, 2025

Author

Zach Mueller

Word Count

575

Language

English

Hacker News Points

-

Source URL

lambda.ai/blog/how-to-serve-kimi-k2-instruct-on-lambda-with-vllm

Summary

Deploying the Kimi-K2-Instruct model, a one-trillion-parameter Mixture-of-Experts (MoE) language model by Moonshot AI, on Lambda using vLLM facilitates efficient multi-GPU inference, overcoming its immense memory requirements which exceed a terabyte and are impractical for average home setups. By utilizing an 8× NVIDIA Blackwell GPU instance, users can manage the model's capabilities in fast reasoning, long-context understanding, and robust tool-use performance. The deployment involves spinning up a GPU instance, setting up a vLLM server, and running benchmarks to gather metrics like time-to-first-token and throughput. The process also includes specific configurations to optimize performance, such as enabling auto-tool choice and using sleep mode to manage resources efficiently. This setup provides a replicable framework for running other large models that do not fit on a single GPU, ensuring scalable and robust performance.