When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug
Blog post from Cloudflare
CUBIC, the default congestion controller in Linux, is crucial for managing TCP and QUIC connections, including those served by Cloudflare's quiche implementation. A bug was discovered where CUBIC's congestion window became stuck at its minimum after a congestion collapse event, due to a Linux kernel change intended to align CUBIC with app-limited exclusions in RFC 9438. This issue led to unexpected test failures, revealing that the recovery mechanism in quiche's implementation misinterprets the state of the connection, particularly during periods of high packet loss. The bug was traced to a miscalculation of the idle period, causing rapid oscillation between recovery and congestion avoidance states due to an inaccurate adjustment of the congestion recovery start time. The fix involved refining the measurement of idle duration from the last ACK rather than the last packet sent, effectively stabilizing the recovery process and restoring normal CUBIC behavior. This small yet crucial code adjustment resolved the issue and was incorporated into Cloudflare's quiche, highlighting the complexity of congestion control algorithms and the importance of precise timing in network protocols.