Shedding old code with ecdysis: graceful restarts for Rust services at Cloudflare
Blog post from Cloudflare
Ecdysis is a Rust library developed by Cloudflare that facilitates zero-downtime upgrades for network services by enabling graceful process restarts without dropping live connections or refusing new ones. After five years of successful implementation across Cloudflare's extensive global network, ecdysis has been open-sourced, allowing broader access to its capabilities. The library employs a fork-and-exec model, similar to that pioneered by NGINX, where a parent process forks a child process that inherits socket file descriptors, allowing continuous connection handling during upgrades. This method ensures that updates, security patches, and new features can be deployed without interrupting service, crucial for Cloudflare's critical operations like traffic routing and TLS management. Ecdysis supports asynchronous programming with Tokio and integrates with systemd for process lifecycle management, enhancing its utility for network services requiring high uptime. It has been instrumental in maintaining service reliability at Cloudflare by preventing millions of failed requests during upgrades.