Alaska Airlines recently experienced a three-hour outage due to the unexpected failure of a multi-redundant hardware component, highlighting that redundancy alone does not guarantee resilience. The incident underscores the importance of balancing cost and resilience by utilizing data-driven decisions, such as resilience testing and blackhole experiments, which simulate outages to assess system performance under failure conditions. Gremlin, a platform for reliability management, offers tools for standardized testing and provides insights into redundancy and resilience, enabling organizations to prevent outages by making informed decisions about infrastructure investments. Regular testing and data analysis are crucial for maintaining system reliability, as they allow teams to identify and rectify potential vulnerabilities before they lead to significant disruptions.