The Windows DLL loader lock: how a Rust thread can hang your JVM
Blog post from QuestDB
QuestDB, an open-source time-series database, faced a complex issue involving a sporadic hang in its Windows Continuous Integration (CI) pipeline due to a deadlock, which was uncovered after an in-depth investigation into various system components, including the Java Virtual Machine's (JVM) garbage collection, Rust's thread-local storage, the Java Native Interface (JNI), and Windows' Loader Lock. The deadlock arose when the operating system held the Loader Lock during thread termination, blocking Rust threads at the safepoint barrier due to garbage collection, and simultaneously preventing new Java threads from initializing due to their need for the Loader Lock. This issue was further complicated by interactions between Java's safepoint mechanism, Rust's TLS destructors, JNI's thread attachment, and Windows' loader lock, revealing a critical lock inversion problem. The resolution involved explicitly detaching threads before TLS destructors ran, using tools like ProcDump, WinDbg, and safepoint timeout logging to diagnose and fix the problem, and opening upstream issues to address the library design flaw in jni-rs. This experience underscored the complexities of debugging and the need for explicit thread management when using JNI with Rust on Windows.