How we stopped xmin horizon blocking Postgres vacuuming: a deepdive
Blog post from Trigger.dev
On August 15, 2024, a significant spike in CPU usage in the primary database led to prolonged processing times and failure of some database transactions, causing a backlog in queues and a "System failure" status for certain processes. The root cause was identified as an issue with Postgres vacuuming, specifically with the xmin horizon, which prevented dead tuples from being cleaned, thus exacerbating the problem during sequential scans by Graphile Worker. Despite attempts to resolve the issue by terminating transactions and adjusting configurations, the problem persisted until a failover to a read replica was executed, which restored normal operations. The incident highlighted the need for better monitoring of dead tuples and vacuuming processes, and the importance of considering failover as a viable solution to similar issues in the future. Steps are being taken to improve transactional guarantees between Redis and Postgres and to enhance monitoring systems to prevent recurrence.