Improving Software Failure: Measure, Change, Learn

What's this blog post about?

The text discusses how to use DORA metrics to measure, learn from, and improve software development failures. It emphasizes the importance of collaboration among team members, including developers, designers, project managers, management, and ops, in addressing software failure. The article presents three steps for learning from software failure: measuring it, analyzing it, and making changes based on the findings. DORA metrics are introduced as a tool to measure engineering efficiency, with four key metrics: Deployment Frequency, Change Lead Time, Time to Restore Service, and Change Failure Rate. The text also explores different levels of software failure measurements, from incident management systems to more tactical metrics like error rates in Sentry or CPU usage in Datadog or Prometheus. It highlights the importance of a blameless postmortem process for learning from failures and making improvements. Finally, it encourages automation as a way to reduce mistakes and increase efficiency in software development processes.


Date published
Sept. 29, 2022


Word count

Hacker News points
None found.


By Matt Makai. 2021-2024.