Home / Companies / Cloudflare / Blog / Post Details
Content Deep Dive

How we reduced core unit boot time from hours to minutes

Blog post from Cloudflare

Post Details
Company
Date Published
Author
Giovanni Pereira Zantedeschi, Nnamdi Ajah, and Omar Sheik-Omar
Word Count
1,759
Language
English
Hacker News Points
-
Summary

Cloudflare faced significant delays in its server reboot process following a firmware update, with core servers taking up to four hours to come back online due to a firmware quirk and inefficient network boot interface selection. The problem arose from the servers conducting a linear search through every available network boot interface, causing prolonged timeout periods before reaching the correct boot stage. To resolve this, Cloudflare restructured its boot automation workflow to declare the correct network boot interface order early in the process, drastically reducing boot and upgrade times from hours to mere minutes. This involved collaborating with OEM vendors to enable programmatic boot order control, overcoming challenges with legacy support and differing network card vendor strings, and optimizing the use of open-source tools like iPXE. The changes not only streamlined their processes but also eliminated the need for manual BIOS interactions, enabling dynamic, scalable, and automated server provisioning across Cloudflare's globally distributed fleet.