Company
Date Published
Author
Colin Steele
Word count
934
Language
English
Hacker News points
None

Summary

Dropbox encountered a significant challenge when a power outage rendered its sole data center hosting Grafana Loki inaccessible, prompting the company to enhance its logging infrastructure. This incident led to the development of a petabyte-scale, multi-region logging platform that can handle up to 6 GB of logs per second with a 30-day retention policy, ensuring availability even during data center failures. The transition involved addressing issues such as high cardinality and memory crashes by imposing strict label control and implementing stream-level controls to prevent one service from overwhelming the system. Gradual deployment and testing were crucial, as was collaboration with Grafana Labs to improve performance with a switch to a Prometheus-style database. This new, robust system has become an integral part of Dropbox's observability stack, allowing the company to phase out its legacy logging system and ensuring high availability and reliability across its operations.