When upserts don't update but still write: Debugging Postgres performance at scale
Blog post from Datadog
At Datadog, a new upsert query designed to track the last time hosts were seen unexpectedly doubled disk writes and quadrupled Write-Ahead Logging (WAL) syncs due to row locks, despite most operations being no-ops. This occurred because even non-updating upserts lock rows, which is recorded in the WAL. With Postgres's single-writer limitation, this increased disk usage was unsustainable, especially at the scale of Datadog's operations. To address this, Datadog's engineers analyzed the WAL using the pg_walinspect extension and rewrote the query to avoid locking rows unnecessarily by employing a data-modifying Common Table Expression (CTE). This change reduced overhead by preventing unwanted WAL entries while maintaining the system's correctness. The exploration highlighted the importance of understanding WAL behavior and optimizing queries to manage database performance effectively, even for operations expected to be no-ops.