Company
Date Published
Author
Dale McDiarmid & Tom Schreiber
Word count
3614
Language
English
Hacker News points
None

Summary

The journey of loading a real-world dataset into ClickHouse involves sampling, preparing, enriching, and optimizing the schema for specific queries. The NOAA Global Historical Climatology Network dataset was used, which contains 1 billion rows of climate data from 1900 to 2022. The dataset was downloaded in compressed format, filtered for relevant measurements, and then loaded into a ClickHouse instance. The data was enriched with additional information such as country names, latitudes, and longitudes using the `clickhouse-local` tool. A dictionary-based query system was implemented to efficiently search for weather events within specific geographical regions, reducing query execution time by orders of magnitude.