Home / Companies / AI21 Labs / Blog / Post Details
Content Deep Dive

Stride and prejudice: How a 32-bit overflow corrupted a CUDA kernel (and stayed hidden for weeks)

Blog post from AI21 Labs

Post Details
Company
Date Published
Author
Tamer Ghattas, DL Engineer
Word Count
2,000
Company Posts That Month
3
Language
English
Hacker News Points
-
Summary

While training the Jamba 3B model using GRPO, a mysterious log probability mismatch between rollout and training was discovered, eventually traced to a silent integer overflow deep within a vLLM CUDA kernel, occurring when cache slots exceeded approximately 47,935. The bug, related to a 32-bit arithmetic overflow during pointer arithmetic in the Mamba-1 selective scan kernel, resulted in corrupted cache slots and incorrect memory writes. The debugging process involved a strategic approach, including isolating the issue from the complex RL training system, identifying structured patterns in error spikes, and testing various configurations to pinpoint the problem to the inference path. Ultimately, the resolution required changing just two characters in the code. This case highlighted the importance of targeted debugging and isolating issues within distributed RL systems, where symptoms often obscure the true source of the problem.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
Reinforcement learning 2 121 52 29 -1%