Home / Companies / RunPod / Blog / Post Details
Content Deep Dive

Building for resilience: Runpod’s response to the AWS us-east-1 outage

Blog post from RunPod

Post Details
Company
Date Published
Author
Mo King
Word Count
503
Language
English
Hacker News Points
-
Summary

AWS's significant outage in the us-east-1 region last week disrupted numerous services, including Runpod's operations, affecting console availability and delaying Pod provisioning and access due to dependencies on AWS infrastructure. Despite the outage, Runpod's GPU compute resources remained operational, and no data loss or configuration changes occurred. In response, Runpod's engineering team quickly implemented redundancies by deploying core services across multiple AWS regions to ensure platform resilience in future incidents, enhancing their Serverless platform to function independently of control plane disruptions by using cached configurations. This incident highlighted the need for a more robust architecture, prompting Runpod to plan for a transition to a partitioned multi-region deployment on its network, aiming for enhanced resilience with automated load balancing and failover capabilities to mitigate the impact of similar future outages.