Storing time series data in sharded MySQL to power Query Insights

Company

PlanetScale

Date Published

Aug. 10, 2023

Author

Rafer Hazen

Word count

2012

Language

English

Hacker News points

URL

planetscale.com/blog/storing-time-series-data-in-sharded-mysql

Summary

We built a scalable telemetry pipeline using Apache Kafka and a sharded PlanetScale database, which stores time series data for our built-in query performance tool Insights. To address the challenges of storing query pattern-level statistics and individual slow query events, we used Prometheus for database-level aggregations and MySQL for storing query-pattern level statistics and slow query events, leveraging Vitess to proxy query traffic and instrument our internal build. The pipeline uses Kafka Consumers to read data from Kafka topics, which are then written to MySQL, where they are stored in sharded tables with a database ID as the shard key. To efficiently collect and store percentile data, we used DDSketches, a probabilistic data structure that allows for fast and accurate quantile estimates. We also chose MySQL due to its high cardinality of our primary dimension, well-known set of dimensions, ability to filter datasets, and natural shard key, avoiding the need for an additional storage system. The pipeline has scaled linearly with the number of shards, and we've successfully run it on fairly small machines, allowing us to scale up as needed.