How we made PlanetScale’s background jobs self-healing

Company

PlanetScale

Date Published

Feb. 17, 2022

Author

Mike Coutermarsh

Word count

856

Language

English

Hacker News points

URL

planetscale.com/blog/how-we-made-planetscale-background-jobs-self-healing-with-sidekiq

Summary

When building PlanetScale, the team had two hard requirements for their background job system: data loss would not impact functionality and a single failed job would be automatically re-run. To achieve this, they used Sidekiq as their background queueing system. The core design decision was to set up another job whose responsibility is to schedule the original job to run, allowing the system to self-heal if jobs are lost or fail. They stored state in the database to ensure that even if a user action fails, the job can still be re-run automatically. Additionally, they added middleware to disable scheduled jobs at any time and implemented bulk scheduling of jobs to improve performance when dealing with large numbers of jobs. The team also handled uniqueness by storing state in the database, using database locks, and utilizing Sidekiq's unique job feature.