Home / Companies / Port / Blog / Post Details
Content Deep Dive

How we would have managed a recent incident at Port with an incident agent

Blog post from Port

Post Details
Company
Date Published
Author
Zohar Einy
Word Count
2,180
Language
English
Hacker News Points
-
Summary

A recent incident at Port involved three teams investigating the same issue independently, leading to inefficient use of time and resources. The incident was triggered by a customer creating 1.7 million automation runs in 90 minutes, causing Kafka offset lag and multiple PagerDuty alerts. Each team attempted to resolve the issue separately through service restarts, but it was only after 43 minutes that one team connected the alerts as related, leading to the realization that the root cause was shared. To prevent such incidents in the future, Port has developed a triage agent equipped with a Context Lake and a triage skill file to handle incident resolution more effectively. The Context Lake centralizes necessary information about services, deployments, and dependencies, while the triage skill file guides the agent's actions. This agent aims to assess incidents, suggest remediation, and execute fixes with human approval, ensuring faster and more coordinated responses. Port's approach emphasizes autonomous incident resolution with controlled oversight, aiming to efficiently handle incidents while maintaining a full audit trail and quantifying the agent's impact on incident management.