How we reduced core unit boot time from hours to minutes
Blog post from Cloudflare
Cloudflare faced significant delays in its server reboot process following a firmware update, with core servers taking up to four hours to come back online due to a firmware quirk and inefficient network boot interface selection. The problem arose from the servers conducting a linear search through every available network boot interface, causing prolonged timeout periods before reaching the correct boot stage. To resolve this, Cloudflare restructured its boot automation workflow to declare the correct network boot interface order early in the process, drastically reducing boot and upgrade times from hours to mere minutes. This involved collaborating with OEM vendors to enable programmatic boot order control, overcoming challenges with legacy support and differing network card vendor strings, and optimizing the use of open-source tools like iPXE. The changes not only streamlined their processes but also eliminated the need for manual BIOS interactions, enabling dynamic, scalable, and automated server provisioning across Cloudflare's globally distributed fleet.