Reliable systems are vital to meeting customer expectations and downtime not only hurts a company's bottom line but can be detrimental to reputation. The goal at Gremlin is to help enterprises build more reliable systems using Chaos Engineering, which involves proactively testing how a system responds under stress in order to identify and fix failures before they cascade into customer-facing issues or system downtime. To achieve this, the team uses Datadog for monitoring their own systems, creating dynamic dashboards with template variables to filter key health metrics across multiple environments and apps, and using synthetic monitoring to keep an eye on outgoing changes and how they affect key user flows. Chaos experiments are used to intentionally provoke problems in a controlled manner, monitor the system's response, and use the collected insights to learn how to best mitigate the problem and prevent it from having a future customer impact. The Gremlin integration with Datadog enables users to get more context around their chaos experiments, allowing them to understand how their experiments play out in real time.