How to extract top-level domains from URLs in ClickHouse ®
Blog post from Tinybird
ClickHouse provides the topLevelDomain() function to efficiently extract the top-level domain (TLD) from URL strings, which is useful for analyzing web traffic, building domain classification systems, and enhancing security measures by flagging suspicious domain extensions. This function returns only the domain extension, such as com, org, or uk, and handles various URL formats, including those with protocol prefixes and subdomains, by returning the rightmost part of a domain name. It gracefully handles malformed URLs by returning an empty string rather than throwing errors. The guide explains how to use topLevelDomain() in conjunction with other ClickHouse functions like domain() and cutWWW() for more comprehensive domain analysis and provides performance optimization tips for processing large URL datasets. Additionally, it introduces Tinybird, a managed ClickHouse service that facilitates the creation of APIs for analytics workloads, abstracting infrastructure management and enabling developers to deploy web analytics APIs with ease.