Building a Machine Learning Logging Pipeline with Kafka Streams at Twitter

Post Details

Company

Confluent

Date Published

Sept. 25, 2020

Author

Victoria Xia, Peilin Yang, Wade Waldron

Word Count

1,630

Language

English

Hacker News Points

-

Source URL

www.confluent.io/blog/how-twitter-built-a-machine-learning-pipeline-with-kafka

Summary

Twitter, a leading social media platform, has revamped its recommendation systems by implementing a new streaming data logging pipeline for its home timeline prediction system, utilizing Apache Kafka® and Kafka Streams, in order to handle billions of tweets daily. This upgrade, which replaces an older offline batch system, significantly reduces pipeline latency from seven days to one day, improving model quality and engineering efficiency. Central to this system is a customized left-join functionality in Kafka Streams that efficiently matches features and labels in machine learning models, allowing Twitter to maintain up-to-date models that adapt to changing user behaviors and trends. The blog post details this customization process, highlighting the unique challenges and solutions, such as handling consumer lag and ensuring data quality, while also acknowledging the contributions of numerous team members and the potential future enhancement of cooperative rebalancing to further bolster the pipeline's performance.