Company
Date Published
Author
Modular Team
Word count
1042
Language
English
Hacker News points
None

Summary

MAX 25.2 is a significant update designed to enhance the performance and deployment of large language models (LLMs) without relying on CUDA, featuring support for over 500 GenAI models and offering multi-GPU compatibility on NVIDIA H100 and H200 hardware. This release includes advancements such as improved scheduling, batching, and caching for superior total cost of ownership (TCO) and performance, making MAX 12% faster than previous benchmarks. The ultra-slim containers reduce deployment times by being 80% smaller than traditional NVIDIA containers, and the integration of Mojo allows for custom, high-performance GPU programming. The update also introduces GPTQ quantization to efficiently run large models, reducing memory usage significantly. By rebuilding the AI stack from scratch, MAX aims to provide an intuitive "it just works" experience that eliminates CUDA-related issues, making it accessible for diverse AI applications and flexible for developers and researchers looking to fully leverage GPU capabilities.