Company
Date Published
Author
Engineering
Word count
593
Language
English
Hacker News points
None

Summary

The US LangSmith API experienced a 28-minute outage on May 1, 2025, due to a combination of human error and lack of observability. A conflicting DNS record was accidentally left over during a migration between certificate renewal automation technologies at the end of January, causing renewal failures in April. Once the root cause was identified, the record was deleted, and a manual SSL certificate renewal was triggered, restoring connectivity. The incident highlighted gaps in observability and led to steps being taken to prevent similar failures, including adding certificate expiry monitors and ensuring all Kubernetes system component logs are ingested.