Handling Billions of LLM Logs with Upstash Kafka and Cloudflare Workers
Blog post from Upstash
Helicone, an open-source LLM observability platform, faced significant challenges in scaling its logging infrastructure to accommodate a growing user base. Initially relying on a serverless architecture using Cloudflare Workers, Helicone's system struggled with inefficient event processing, data loss during downtime, and limitations from Cloudflare Worker constraints. To address these issues, Helicone implemented Upstash Kafka, a persistent queue that efficiently handles high-volume data streaming and enables batch processing. This integration decoupled log ingestion from processing, allowing for scalable and reliable operations. Helicone chose Upstash Kafka for its managed service features, such as an HTTP endpoint and easy integration with serverless architectures. The new setup, involving a Kafka producer and consumer configuration, facilitated efficient log processing by publishing events to Kafka and consuming them in batches through ECS. This overhaul enabled Helicone to manage billions of logs, ensuring robust log ingestion and processing while maintaining flexibility for real-time and historical data analysis. The platform now offers enhanced observability for LLM applications, providing real-time insights and optimizing performance for both startups and enterprises.