Company
Date Published
Author
Matt Dupree
Word count
1590
Language
English
Hacker News points
None

Summary

Heap, a product analytics platform, encountered a significant operational incident while attempting to roll out over one million indexes across its distributed Postgres cluster, which led to increased ingestion latency—delays in data visibility for users. The incident occurred when the creation of a new index type for their Effort Analysis feature inadvertently triggered the creation of additional core indexes, straining the system beyond its capacity. Despite precautions like scheduling and small-scale testing, the unexpected system stress highlighted deficiencies in their risk assessment approach, particularly the need for more comprehensive monitoring of index sync queue sizes and the impact of ambient syncs. The incident underscored critical lessons for managing large, distributed systems at scale, such as the importance of aligning testing conditions with real-world scenarios and understanding system load changes over time. As Heap continues to grow, it aims to refine its process by marking certain indexes for "backfill only" to prevent similar issues and to continuously adapt its practices to accommodate the evolving demands of its infrastructure.