Company
Date Published
Author
-
Word count
549
Language
English
Hacker News points
None

Summary

On May 1, 2025, the US LangSmith API experienced a 28-minute outage due to a SSL certificate expiry, causing approximately 55% of requests to fail. This incident was traced back to a conflicting DNS record created during a migration to new certificate renewal technologies in January, which prevented successful renewals throughout April. The expired certificate led to connection failures and user-reported issues before the root cause was identified, involving human error and a lack of observability in monitoring certificate renewals. Once identified, the issue was resolved by deleting the DNS record and manually renewing the certificate. The incident highlighted gaps in monitoring and response processes, prompting LangSmith to implement additional measures such as certificate expiry monitors, logs for Kubernetes components, and an internal dashboard for critical workflows to improve future reliability and incident response.