How to serve Kimi-K2-Instruct on Lambda with vLLM
Blog post from Lambda
Deploying the Kimi-K2-Instruct model, a one-trillion-parameter Mixture-of-Experts (MoE) language model by Moonshot AI, on Lambda using vLLM facilitates efficient multi-GPU inference, overcoming its immense memory requirements which exceed a terabyte and are impractical for average home setups. By utilizing an 8× NVIDIA Blackwell GPU instance, users can manage the model's capabilities in fast reasoning, long-context understanding, and robust tool-use performance. The deployment involves spinning up a GPU instance, setting up a vLLM server, and running benchmarks to gather metrics like time-to-first-token and throughput. The process also includes specific configurations to optimize performance, such as enabling auto-tool choice and using sleep mode to manage resources efficiently. This setup provides a replicable framework for running other large models that do not fit on a single GPU, ensuring scalable and robust performance.