Home / Companies / Tinybird / Blog / Post Details
Content Deep Dive

How to extract the first Significant Subdomain From URLs in ClickHouse ®

Blog post from Tinybird

Post Details
Company
Date Published
Author
Cameron Archer
Word Count
2,844
Language
English
Hacker News Points
-
Summary

The cutToFirstSignificantSubdomain() function in ClickHouse® is designed to streamline web traffic analysis by normalizing domains, effectively consolidating metrics for URLs that represent the same core domain but appear with varying subdomains. This function extracts the registrable domain while respecting complex top-level domain structures, excluding prefixes like 'www' or 'api', thus unifying analytics under a single domain identity, such as converting news.bbc.co.uk to bbc.co.uk. It handles edge cases such as IPv4 and IPv6 addresses by returning empty strings, and it efficiently processes large datasets, outperforming regex-based alternatives in speed and memory usage. The function is particularly beneficial for developers building real-time APIs and dashboards, facilitating domain-based analytics by grouping metrics by organization rather than individual subdomains, which simplifies content categorization and deduplication. Implementing this function involves creating materialized views that automatically process data, providing a robust foundation for analytics queries without the need for complex regex patterns or public suffix list maintenance. In e-commerce analytics, this function aids in tracking conversion funnels across various domains and subdomains, ensuring consistent domain normalization and enabling detailed insights into conversion rates, revenue metrics, and time-series analysis. Platforms like Tinybird offer managed ClickHouse® services, abstracting infrastructure complexities and enhancing the ease of building and deploying real-time analytics APIs that leverage this function.