/plushcap/analysis/fivetran/kafka-is-not-a-database

Kafka Is Not a Database

What's this blog post about?

Apache Kafka is an open-source message broker that has gained popularity due to its ability to scale to large numbers of messages. It is often used to decouple producers and consumers of data, such as in the case of Fivetran where it buffers customer-generated webhooks before loading them into a data warehouse. However, some proponents argue that Kafka can replace traditional relational databases as the definitive record of events. This architecture, known as "turning the database inside out," involves appending events to Kafka and reading from downstream views representing the present state. While it is possible in theory to implement this architecture with both reads and writes, doing so would require addressing every hard problem faced by traditional database management systems (DBMS). This includes dealing with issues like dirty reads, phantom reads, write skew, and other symptoms of a hastily implemented DBMS. The fundamental issue with using Kafka as the primary data store is that it provides no isolation. Isolation means that all transactions occur along some consistent history. Without proper isolation, anomalies can arise in scenarios like inventory management for an online store, where two users could end up purchasing the same item due to outdated inventory readings. To avoid these problems, Kafka can be used alongside a traditional database. OLTP databases are good at admission control of events and ensuring only consistent streams of events are emitted. They can also handle millions of transactions per second. Using change-data-capture (CDC) with an OLTP database allows for event generation while maintaining an operational story in recovery scenarios. In conclusion, real-time streaming message brokers like Kafka are useful tools for managing high-velocity data, but traditional DBMSs are still needed for proper isolation and transaction management. The best approach is to use OLTP databases for admission control, CDC for event generation, and model downstream copies of the data as materialized views.

Company
Fivetran

Date published
Dec. 8, 2020

Author(s)
Arjun Narayan

Word count
969

Hacker News points
9

Language
English


By Matt Makai. 2021-2024.