Unlocking Idempotency with Retroactive Tombstones

Company

WarpStream

Date Published

Nov. 18, 2023

Author

Richard Artoul

Word count

2390

Language

English

Hacker News points

URL

www.warpstream.com/blog/unlocking-idempotency-with-retroactive-tombstones

Summary

WarpStream is an Apache Kafka protocol compatible data streaming system built on top of object storage, with zero local disks and no inter-zone bandwidth costs. It separates data from metadata, allowing for a massively parallel write engine without synchronization or serialization issues. The system uses a metadata store to track batch sequence IDs, enabling idempotent producer functionality that ensures duplicate batches are dropped before being written to immutable segment files in object storage. This separation of data and metadata also enables "retroactive tombstoning" to identify and drop duplicate batches after they've been written. While implementing idempotency, WarpStream introduced a performance bottleneck due to the need for compaction to merge smaller batches into larger ones, but this was addressed by modifying the file cache interface to support reading batches in a single RPC.