How we would have managed a recent incident at Port with an incident agent
Blog post from Port
A recent incident at Port involved three teams investigating the same issue independently, leading to inefficient use of time and resources. The incident was triggered by a customer creating 1.7 million automation runs in 90 minutes, causing Kafka offset lag and multiple PagerDuty alerts. Each team attempted to resolve the issue separately through service restarts, but it was only after 43 minutes that one team connected the alerts as related, leading to the realization that the root cause was shared. To prevent such incidents in the future, Port has developed a triage agent equipped with a Context Lake and a triage skill file to handle incident resolution more effectively. The Context Lake centralizes necessary information about services, deployments, and dependencies, while the triage skill file guides the agent's actions. This agent aims to assess incidents, suggest remediation, and execute fixes with human approval, ensuring faster and more coordinated responses. Port's approach emphasizes autonomous incident resolution with controlled oversight, aiming to efficiently handle incidents while maintaining a full audit trail and quantifying the agent's impact on incident management.