Handling 300k requests per day: an adventure in scaling
Blog post from Firecrawl
Upon joining the Firecrawl team, the focus was initially on developing new features and fixing minor bugs until an influx of users strained the existing architecture, prompting a need for significant changes. The team's API service, hosted on Fly.io, faced issues with job locks and inefficient scrape request handling, which were exacerbated by frequent redeployments due to memory leaks. A discovery revealed misconfigured queue lock options, which, when corrected, significantly improved performance by reducing crawl lock times and allowing concurrent processing. The transition from Bull to BullMQ enhanced task management with better APIs and active maintenance, despite encountering issues with Redis egress fees and connectivity that led to a costly bill. Integrating Sentry for monitoring unearthed previously unknown bugs and improved overall service reliability, allowing the team to confidently proceed with a major launch week. Looking forward, the team is exploring Kubernetes for better scaling control, reflecting a desire for ongoing improvements in reliability and performance.