How we stopped xmin horizon blocking Postgres vacuuming: a deepdive

Post Details

Company

Trigger.dev

Date Published

Aug. 16, 2024

Author

Matt Aitken

Word Count

1,903

Language

English

Hacker News Points

-

Source URL

trigger.dev/blog/stopping-xmin-horizon-blocking-postgres-vacuuming

Summary

On August 15, 2024, a significant spike in CPU usage in the primary database led to prolonged processing times and failure of some database transactions, causing a backlog in queues and a "System failure" status for certain processes. The root cause was identified as an issue with Postgres vacuuming, specifically with the xmin horizon, which prevented dead tuples from being cleaned, thus exacerbating the problem during sequential scans by Graphile Worker. Despite attempts to resolve the issue by terminating transactions and adjusting configurations, the problem persisted until a failover to a read replica was executed, which restored normal operations. The incident highlighted the need for better monitoring of dead tuples and vacuuming processes, and the importance of considering failover as a viable solution to similar issues in the future. Steps are being taken to improve transactional guarantees between Redis and Postgres and to enhance monitoring systems to prevent recurrence.