How We Built a Distributed Work Scheduling System for Pulumi Cloud
Blog post from Pulumi
Pulumi Cloud has developed a comprehensive scheduling system to manage various workflows, such as deployments, insights discovery scans, and policy evaluations, across both its own infrastructure and customer-managed environments. Initially designed for simple deployment tasks, the system evolved to accommodate more complex requirements, such as retries, failure handling, and dynamic routing, leading to the creation of the "background activity system." This system supports different execution modes—direct for Pulumi-hosted environments and remote for customer-managed setups—allowing it to function seamlessly in diverse network conditions. Instead of using off-the-shelf queue solutions, Pulumi opted to build its own system to minimize external dependencies and ensure compatibility with self-hosted installations. By employing a lease-based concurrency model, the system avoids double-execution and handles failures through lease expiration, enabling dynamic recovery without manual intervention. The architecture is designed to be extensible, ensuring that new workflow types can be integrated with full support for scheduling, retries, and observability, thus simplifying the operational complexity of distributed execution.