Four Considerations When Designing Systems For Graceful Degradation

Post Details

Company

New Relic

Date Published

April 15, 2020

Author

Dan Sullivan

Word Count

968

Language

English

Hacker News Points

-

Source URL

newrelic.com/blog/observability/design-software-for-graceful-degradation

Summary

Systems can experience unexpected behaviors when pushed beyond their limits, leading to service degradation and outages, often without a singular root cause. To mitigate such issues, Site Reliability Engineering (SRE) and DevOps teams should plan for macro-level problems like saturation and excessive workloads by designing systems for resiliency and graceful degradation. This can involve strategies such as shedding workload, time-shifting workloads, reducing the quality of service, or adding more capacity, with each method offering unique advantages and considerations. Load shedding involves dropping lower-priority requests to manage excessive demand, while time-shifting decouples request generation from processing, allowing asynchronous handling through message queues. Reducing the quality of service can alleviate system stress by offering limited functionality, and adding capacity through autoscaling is ideal but requires sufficient resources. Advanced capacity planning and proactive monitoring are crucial for building resilient systems that maintain uptime by managing workload patterns and complex failure modes effectively.