What’s New in Prometheus 2.8: WAL-Based Remote Write
Blog post from Grafana Labs
Prometheus 2.8 introduces significant enhancements to the remote_write API by implementing Write-Ahead Logging (WAL), which aims to protect client metrics from network issues. Previously, the API relied on a small in-memory buffer that could lead to data loss or excessive memory usage if remote endpoints were unreachable. The new approach involves writing transactions to a WAL before committing data to long-term storage, allowing Prometheus to pause and retry sending data without data loss or memory problems. This change ensures more predictable memory usage, as the buffer depends on disk size rather than memory, and optimizes the process by encoding data once for multiple send attempts. Although initially intended as a short project, the implementation required addressing various edge cases involving locking, parallelization, and log file integrity. The update has been well-received, with reduced CPU usage and improved memory predictability, and while some issues with WAL corruption have arisen, solutions are being explored. Feedback is encouraged via GitHub to further refine the update.