Home / Companies / Honeycomb / Blog / Post Details
Content Deep Dive

Scaling Kafka at Honeycomb

Blog post from Honeycomb

Post Details
Company
Date Published
Author
Liz Fong-Jones
Word Count
2,924
Language
English
Hacker News Points
-
Summary

Honeycomb has utilized Apache Kafka for buffering telemetry data in its observability pipeline, focusing on ensuring durability, reliability, and efficient operability. Over the years, Honeycomb has meticulously optimized its Kafka infrastructure, transitioning from c5.xlarge instances to AWS Graviton2-powered instances, and adopting Confluent's tiered storage for cost efficiency and scalability. These changes were driven by the need to maintain the integrity of a 24 to 48-hour data buffer and to ensure rapid recovery from any system failures. Despite experimenting with various configurations, including the use of AWS's gp3 EBS storage and the Graviton2 instances, Honeycomb faced challenges in achieving stability due to unforeseen saturation and reliability issues. Eventually, they settled on using im4gn.4xlarge instances for their Kafka clusters, which offered a balanced ratio of compute, storage, and network resources, supporting Honeycomb's rapid growth while reducing the total cost of ownership. Honeycomb emphasizes the importance of leveraging existing expertise and infrastructure to manage Kafka clusters effectively, as evidenced by their significant reduction in cost per megabyte of data throughput despite a substantial increase in data volume.