/plushcap/analysis/ably/engineering-dependability-and-fault-tolerance-in-a-distributed-system

Engineering a fault tolerant distributed system

What's this blog post about?

The text discusses the design and engineering of fault tolerant systems that can detect and remediate failures at scale. It explains dependability as a measure of both availability and reliability, with availability being when a service is available for use when required, and reliability being whether a service works as expected. A key aspect of fault tolerance is redundancy, which involves exceeding the capacity required to deliver service. The text also covers stateless and stateful services, architectural approaches to achieve reliability, consensus formation in globally-distributed systems, health is not binary, resource availability impacts on fault tolerance, and resource scalability impacts on fault tolerance. It concludes by stating that fault tolerance is an approach to building systems able to withstand and mitigate adverse events and operating conditions in order to dependably continue delivering the level of service expected by users.

Company
Ably

Date published
Feb. 15, 2021

Author(s)
Paddy Byers

Word count
3669

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.