The OOM that was not your agent
Blog post from Pydantic
A document classifier built on the Anthropic SDK is consistently failing at 2am due to out-of-memory (OOM) issues, despite no changes in code, model, or traffic. Investigation reveals that while the classifier's memory usage peaks at 95% just before failure, the increase begins prior to the job's start, indicating another process is consuming resources. Specifically, a Postgres vacuum job spikes disk I/O and contributes to the memory load. Utilizing OpenTelemetry (OTel) host metrics, the issue is diagnosed by observing CPU, memory, and disk activity, allowing for adjustments such as rescheduling the vacuum and capping memory usage, which resolves the problem. The system integrates various AI and monitoring tools, facilitating comprehensive diagnostics by linking host metrics and trace data, and employs a non-proprietary OTel collector to gather and display host metrics efficiently across platforms like Kubernetes. The solution demonstrates the importance of cohesive monitoring tools in identifying and resolving system performance issues, and offers an accessible setup for those using OpenTelemetry, with an invitation to try the free tier of Logfire for enhanced monitoring capabilities.
| Trend | Post Mentions | Total Month Mentions | Posts | Companies | MoM |
|---|---|---|---|---|---|
| Kubernetes | 4 | 1,993 | 294 | 100 | +1% |
| OpenTelemetry | 4 | 701 | 153 | 53 | -26% |
| MCP | 1 | 6,026 | 689 | 188 | -15% |