Home / Companies / Pydantic / Blog / Post Details
Content Deep Dive

The OOM that was not your agent

Blog post from Pydantic

Post Details
Company
Date Published
Author
-
Word Count
612
Company Posts That Month
22
Language
English
Hacker News Points
-
Summary

A document classifier built on the Anthropic SDK is consistently failing at 2am due to out-of-memory (OOM) issues, despite no changes in code, model, or traffic. Investigation reveals that while the classifier's memory usage peaks at 95% just before failure, the increase begins prior to the job's start, indicating another process is consuming resources. Specifically, a Postgres vacuum job spikes disk I/O and contributes to the memory load. Utilizing OpenTelemetry (OTel) host metrics, the issue is diagnosed by observing CPU, memory, and disk activity, allowing for adjustments such as rescheduling the vacuum and capping memory usage, which resolves the problem. The system integrates various AI and monitoring tools, facilitating comprehensive diagnostics by linking host metrics and trace data, and employs a non-proprietary OTel collector to gather and display host metrics efficiently across platforms like Kubernetes. The solution demonstrates the importance of cohesive monitoring tools in identifying and resolving system performance issues, and offers an accessible setup for those using OpenTelemetry, with an invitation to try the free tier of Logfire for enhanced monitoring capabilities.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
Kubernetes 4 1,993 294 100 +1%
OpenTelemetry 4 701 153 53 -26%
MCP 1 6,026 689 188 -15%