How to extract URL paths without query strings in ClickHouse ®
Blog post from Tinybird
Analyzing web traffic and API logs in ClickHouse can be complicated by URLs with query parameters, which fragment identical page views into separate entries, skewing analytics and making trend analysis challenging. To address this, ClickHouse offers built-in functions such as cutQueryString(), path(), and pathFull() to clean URLs by removing query parameters, preserving the path, or retaining the full URL with query strings. The cutQueryString() function, recommended for modern ClickHouse versions, quickly strips query strings, while older versions may rely on regex alternatives. These functions improve URL grouping and analysis, particularly when used with normalization techniques like canonicalizing trailing slashes or converting paths to lowercase. Performance considerations favor using built-in functions over regex for large datasets, and URL cleaning can be applied during data ingestion or on-demand in queries, depending on specific requirements. Tinybird further enhances this process by enabling the deployment of web analytics queries as APIs, allowing for efficient and real-time data analysis without exposing the database.