How Cortex uses the Prometheus Write-Ahead Log (WAL) to prevent data loss
Blog post from Grafana Labs
Cortex initially faced a significant flaw in its ingester service, which risked data loss if an ingester crashed, as it temporarily stored incoming series data in memory before writing it to long-term storage. To address this, Cortex integrated a Write-Ahead Log (WAL) similar to Prometheus' TSDB, allowing ingesters to log events to a file and replay them on crash recovery, thus restoring in-memory states. However, WAL alone was insufficient under heavy loads, prompting the use of checkpoints, which store compressed data chunks on disk for faster replay. Originally experimental, these enhancements have undergone thorough testing and improvements, such as optimized disk writes and a switch to Prometheus TSDB WAL format, and are poised to lose their experimental status. Those interested in adopting WAL with Cortex can refer to the production guide, while further insights into Cortex can be gleaned from a Grafana Labs webinar featuring key contributors.