Lessons From Rolling Out One Million Postgres Indexes

Post Details

Company

Heap

Date Published

Sept. 30, 2021

Author

Matt Dupree

Word Count

1,590

Language

English

Hacker News Points

-

Source URL

www.heap.io/blog/the-million-indexes-incident

Summary

Heap, a product analytics platform, encountered a significant operational incident while attempting to roll out over one million indexes across its distributed Postgres cluster, which led to increased ingestion latency—delays in data visibility for users. The incident occurred when the creation of a new index type for their Effort Analysis feature inadvertently triggered the creation of additional core indexes, straining the system beyond its capacity. Despite precautions like scheduling and small-scale testing, the unexpected system stress highlighted deficiencies in their risk assessment approach, particularly the need for more comprehensive monitoring of index sync queue sizes and the impact of ambient syncs. The incident underscored critical lessons for managing large, distributed systems at scale, such as the importance of aligning testing conditions with real-world scenarios and understanding system load changes over time. As Heap continues to grow, it aims to refine its process by marking certain indexes for "backfill only" to prevent similar issues and to continuously adapt its practices to accommodate the evolving demands of its infrastructure.