Remediate issues autonomously with Bits Infrastructure Operations
Blog post from Datadog
Bits Infrastructure Operations by Datadog is an advanced tool designed to autonomously detect, investigate, and remediate common infrastructure issues across various environments, including hosts, Kubernetes, serverless functions, and network infrastructure. It aims to alleviate the burden on infrastructure teams by automatically resolving issues like disk saturation, CrashLoopBackOff errors, and expiring TLS certificates before they escalate into incidents. The tool allows application engineers to safely address infrastructure issues affecting their services while platform engineers maintain control through defined guardrails. These guardrails set operational boundaries, ensuring safe remediation actions based on the environment and resource type, while a human-in-the-loop workflow allows teams to approve high-priority fixes. Additionally, Bits Infrastructure Operations assists teams in preventing recurring issues by learning from previously approved fixes and updating guardrails for future autonomous remediation. It also extends into the pull request workflow to flag risky infrastructure-as-code changes before they reach production, using real-time telemetry data to assess potential impacts. By reducing repetitive operational work, Bits Infrastructure Operations enables platform teams to focus on systemic improvements, ultimately enhancing the overall reliability and performance of their infrastructure.