The US LangSmith API experienced a 28-minute outage on May 1, 2025, due to a combination of human error and lack of observability. A conflicting DNS record was accidentally left over during a migration between certificate renewal automation technologies at the end of January, causing renewal failures in April. Once the root cause was identified, the record was deleted, and a manual SSL certificate renewal was triggered, restoring connectivity. The incident highlighted gaps in observability and led to steps being taken to prevent similar failures, including adding certificate expiry monitors and ensuring all Kubernetes system component logs are ingested.