Distributed, Real-time Joins and Aggregations on User Activity Events using Kafka Streams

Company

Confluent

Date Published

June 23, 2016

Author

Michael Noll, Victoria Xia, Wade Waldron

Word count

2349

Language

English

Hacker News points

None

URL

www.confluent.io/blog/distributed-real-time-joins-and-aggregations-on-user-activity-events-using-kafka-streams

Summary

In this blog post, Kafka Streams is used to build an end-to-end streaming application that analyzes Wikipedia real-time updates through a combination of Kafka Streams and Kafka Connect. The goal is to enrich an incoming stream of user click events with the latest geo-region information for users and then compute aggregations based on the enriched stream. The authors argue that traditional approaches to implementing this use case, such as querying an external database, are problematic due to scalability issues. Instead, they introduce the concept of a stream-table duality and leverage Kafka Streams' built-in support for KTables, which are backed by state stores in Kafka Streams. This allows for fast local table lookups without network round-trips and decouples the availability of the stream processing application from that of an external database. The authors demonstrate how to implement this use case with Kafka Streams using a KStream-KTable join to enrich the user click events with geo-location side data and then compute aggregations based on the enriched stream.