Company
Date Published
Author
Danny Kopping
Word count
3809
Language
English
Hacker News points
None

Summary

Grafana Cloud Logs faced significant challenges with its Grafana Loki logs database when traffic surpassed the limits of cloud vendors' object storage services, prompting a comprehensive redesign of its caching strategy. The service initially used memcached clusters sized for recent data, but diverging access patterns and excessive cache churn led to inefficiencies and costly object storage retrievals. To address these issues, the team transitioned to using local SSDs for caching, which dramatically increased capacity from 1.2TB to 50TB while reducing costs by 98%. This change, facilitated by memcached's extstore feature, significantly improved cache hit rates, reduced requests to object storage by 65%, and effectively eliminated rate-limiting issues. Despite increased latency due to the introduction of disks, the solution proved effective, showcasing a careful balance of trade-offs in software engineering, with a noteworthy reduction in operational costs and improved performance.