Achieving State-of-the-Art Performance on AMD MI355 — in Just 14 Days
Blog post from Modular
In preparation for a demonstration at AMD's Media Tech Day, Modular collaborated with AMD and TensorWave to optimize their AI hardware on the AMD Instinct MI355 in just two weeks, despite initial challenges due to not having access to the hardware. Utilizing their architecture-agnostic software stack, Modular was able to rapidly adapt and optimize their systems, achieving state-of-the-art performance by leveraging features like larger tensor-core tile sizes and increased shared memory. The effort, which involved only two engineers, resulted in impressive performance gains, with MAX outperforming AMD's optimized vLLM fork by up to 2.2x across multiple workloads and maintaining portability across different AMD hardware. This achievement showcased the effectiveness of Modular's architecture in enabling rapid hardware bringup without requiring excessive work hours or crunch time, highlighting the collaborative success at the event.