Home / Companies / Polar Signals / Blog / Post Details
Content Deep Dive

Python Zebra Stacks

Blog post from Polar Signals

Post Details
Company
Date Published
Author
-
Word Count
2,220
Language
English
Hacker News Points
-
Summary

Addressing the challenges of truncated stack traces in PyTorch Lightning training workloads, the text explores two key issues related to eBPF profiling and stack unwinding. The first issue involves read errors caused by memory being paged out during long-running workloads, which can be mitigated by recording faulted pointers and continuing the stack walk, allowing the OS to handle page faults in user space. The second issue is the BPF tail call limit, which restricts the number of transitions between Python and native code, leading to stack truncation. By combining Python and native unwinders into a single BPF program, the solution eliminates excessive tail calls, enabling full stack unwinding within the instruction budget constraints. Dynamic loop tuning via RODATA variables optimizes the instruction budget for both debug-on and debug-off modes, allowing for deeper stack processing. This approach not only resolves the PyTorch-specific problem but also enhances eBPF profiling for large stacks across various languages, demonstrating a significant advancement in handling complex workloads.