A Trip Down Memory Lane: How We Resolved a Memory Leak When pprof Failed Us
Blog post from WarpStream
A memory leak in the WarpStream control plane, indicated by a linear increase in the HeapInUse metric, was investigated using pprof to compare heap profiles, revealing an unexpected retention of FileMetadata objects. These objects, linked with compaction jobs scheduled by the control plane, were found to be retained due to a goroutine leak in the deadscanner scheduler. This scheduler, responsible for removing untracked files from the object store, was inadvertently holding onto job references because it shared a job queue with the compaction scheduler. The issue arose when the deadscanner scheduler continued to run despite the job actor shutdown, as it was busy spinning in a loop due to backpressure from a full queue, leading to improper context cancellation handling. The solution involved modifying the job queue submission function to check for context cancellations before proceeding, which resolved the memory leak after deploying the patch. This case highlighted the importance of employing both broad and detailed diagnostic tools in debugging complex systems, as initial profiling did not reveal the issue until specific components were scrutinized.