Engineering a fault tolerant distributed system

Company

Ably

Date Published

Feb. 15, 2021

Author

Paddy Byers

Word count

3669

Language

English

Hacker News points

None

URL

ably.com/blog/engineering-dependability-and-fault-tolerance-in-a-distributed-system

Summary

Designing fault-tolerant systems involves understanding and addressing the nature of failures, especially in distributed systems where failures are expected and can be non-binary. Key to this design is dependability, measured by both availability and reliability, ensuring a service is both accessible and functions as expected. Redundancy plays a crucial role, providing excess capacity to ensure service continuity even in the event of component failures. Stateless services, which operate independently of past interactions, are easier to design for fault tolerance as they can rely on redundant resources for continued availability. In contrast, stateful services, which depend on the continuity of state across interactions, face more complex challenges, requiring mechanisms such as consensus formation and robust state persistence to ensure reliability. The Ably platform exemplifies these principles by employing multiple layers of fault tolerance mechanisms, including stateful role placement and channel persistence, to provide high levels of service availability and reliability. These efforts are supported by engineering practices that address real-world challenges such as resource availability, scalability, and dynamic consensus formation in globally-distributed systems.