Company
Date Published
Author
Michael Noll, Eno Thereska, Matthias J. Sax, Victoria Xia, Wade Waldron
Word count
1812
Language
English
Hacker News points
2

Summary

The Google Dataflow team has developed a model for handling time in stream processing that recognizes the challenge of globally ordering data arrival, leading to the integration of a system for coping with this issue into the Beam API. This approach is widely adopted by other stream processing systems. A key observation is that most computations can be handled using tables, which represent immutable entities, but many organizations have mutable databases at their core. To address this, a table representation for data stores is introduced, allowing streaming apps to combine pure events with data from these sources. This enables the generalization of the Dataflow model for handling windowed computation, where computations are represented as tables with aggregation IDs and window times. The key benefit of this approach is that it allows for semantic processing of output streams, enabling the modeling of windowed computation without relying on complex watermark systems. Additionally, the frequency at which output is emitted can be tuned to balance update lag and output volume, offering a practical trade-off between these non-functional parameters.