Reducing flaky builds by 18x

Post Details

Company

GitHub

Date Published

Dec. 16, 2020

Author

Jordan Raine

Word Count

1,447

Language

English

Hacker News Points

-

Source URL

github.blog/engineering/engineering-principles/reducing-flaky-builds-by-18x

Summary

A blog post by Jordan Raine discusses the introduction of a new system at GitHub designed to manage flaky tests in their continuous integration (CI) process, significantly reducing the occurrence of such failures. Initially, nearly 9% of commits resulted in red builds due to flaky tests, but after implementing this system, the rate dropped to less than 0.5%. The system works by detecting flaky tests, which are tests that yield inconsistent results with the same code, and it automates the identification and management of these tests based on their failure history and impact. It uses techniques such as retries in different scenarios to identify the root cause of flakiness, whether due to randomness, time-based issues, or order-dependence. The system then assigns issues to relevant developers based on the impact score of the flaky tests, thus ensuring that only the most significant issues are prioritized for human investigation. This approach enhances the reliability of CI, making red builds more meaningful and reducing unnecessary delays in the deployment process.