Why Database Sizing is Hard
Blog post from ScyllaDB
Database sizing, particularly for ScyllaDB, involves a complex process of estimating the size of a cluster based on dataset size, workload, and other factors, and is inherently challenging due to the iterative nature of design and the trade-offs between simplicity, cost, and accuracy. The process requires understanding the performance implications of different configurations, such as replication factors and consistency levels, and balancing them against economic and operational constraints. The text highlights various considerations in database performance modeling, such as the impact of data models, query types, and maintenance operations on workload estimation. It also discusses the challenges of predicting disk operations due to ScyllaDB's use of Log Structured Merge (LSM) trees, the effects of consistency levels on read and write operations, and the strategic decisions involved in scaling a database by adding nodes or selecting larger nodes. The document emphasizes the importance of using initial estimates as a basis for iteration and adjustment once real usage data is available, and it points out that the performance and scalability of ScyllaDB can be enhanced by understanding these dynamics.