Stride and prejudice: How a 32-bit overflow corrupted a CUDA kernel (and stayed hidden for weeks)

Post Details

Company

AI21 Labs

Date Published

March 25, 2026

Author

Tamer Ghattas, DL Engineer

Word Count

2,000

Company Posts That Month

3

Language

English

Hacker News Points

-

Source URL

www.ai21.com/blog/vllm-cuda-integer-overflow

Summary

While training the Jamba 3B model using GRPO, a mysterious log probability mismatch between rollout and training was discovered, eventually traced to a silent integer overflow deep within a vLLM CUDA kernel, occurring when cache slots exceeded approximately 47,935. The bug, related to a 32-bit arithmetic overflow during pointer arithmetic in the Mamba-1 selective scan kernel, resulted in corrupted cache slots and incorrect memory writes. The debugging process involved a strategic approach, including isolating the issue from the complex RL training system, identifying structured patterns in error spikes, and testing various configurations to pinpoint the problem to the inference path. Ultimately, the resolution required changing just two characters in the code. This case highlighted the importance of targeted debugging and isolating issues within distributed RL systems, where symptoms often obscure the true source of the problem.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Reinforcement learning	2	121	52	29	-1%