Lessons Learned from Running Apache Kafka at Scale at Pinterest
Blog post from Confluent
Apache Kafka plays a critical role in Pinterest's data transportation layer, handling a rapidly increasing data flow due to the platform's significant growth in user base and content. This growth has led to various operational challenges, such as performance issues with magnetic disks, which were resolved by switching to SSDs for better IOPS and reduced latency. Pinterest has implemented dynamic rebalancing strategies and message format conversions to optimize Kafka's performance, alongside cost control measures like compression and rack-aware data transfer to reduce AWS costs. The introduction of brokersets has improved topic scaling and partition placement, minimizing the operational impact of traffic spikes. To manage Kafka clusters more efficiently, Pinterest developed Orion, a unified management tool with features that enhance automation and provide better visibility and control over cluster operations. Recent upgrades have focused on ensuring consistency in broker versions and log message formats, with due diligence and client compatibility checks to minimize disruptions. Looking ahead, Pinterest is working on improving client interoperability, scaling efficiency, and enhancing the reliability of its Kafka infrastructure.