Home / Companies / Trigger.dev / Blog / Post Details
Content Deep Dive

OTel incident post-mortem

Blog post from Trigger.dev

Post Details
Company
Date Published
Author
Eric Allam
Word Count
1,798
Language
English
Hacker News Points
-
Summary

Between November 28 and December 1, 2025, intermittent failures occurred during the ingestion of OpenTelemetry trace data into a ClickHouse server due to a partition key design flaw that created numerous tiny data parts, overwhelming the server’s merge capacity. The problem was traced back to the use of a partition key based on the start time of tasks rather than the time data was received, which led to difficulties merging old and new data parts efficiently. Despite the incident affecting the observability data, task execution remained unaffected. The resolution involved creating a new table with a corrected partition key, scaling the infrastructure, and implementing fixes to prevent late-arriving events from creating parts in old partitions. Lessons learned emphasized the importance of partitioning by insertion time to avoid late-arriving data issues and highlighted the need for comprehensive testing that includes long-running tasks and edge cases.