How AI would have handled a real incident at Port
Blog post from Port
During a recent incident at Port, three separate teams were alerted to the same issue, resulting in redundant efforts and delayed resolution. The incident involved a customer generating 1.7 million automation runs in 90 minutes, causing Kafka offset lag and triggering multiple PagerDuty alerts. The teams worked in isolation, unaware of each other's activities, and it took 77 minutes to identify the common root cause. In response, Port is developing an autonomous incident resolution system starting with a triage agent that utilizes a Context Lake to gather comprehensive service and deployment data, enabling swift and informed responses. This agent aims to streamline incident management by correlating alerts, suggesting fixes with human approval, and executing solutions with built-in safeguards. The initiative seeks to enhance efficiency and documentation by automatically compiling post-mortems from existing data, thus moving towards a future of autonomous incident management with controlled human oversight.