Categorize your logs with Elasticsearch categorize_text aggregation
Blog post from Elastic
Elasticsearch's categorize_text aggregation is a powerful new feature designed to enhance log exploration by identifying prevalent log patterns at query time, significantly reducing the time it takes to extract information from large volumes of data. This capability, which is particularly beneficial for system administrators and Site Reliability Engineers (SREs), works by reading text from the document source and creating tokens using a custom tokenizer called ml_standard, which is specifically tailored for machine-generated text. The tokens are then clustered using a modified DRAIN algorithm, focusing on consistent tokens to form category definitions while removing highly variable ones. The feature is integrated into Elasticsearch’s aggregation framework and can be visualized in Kibana, allowing users to identify and compare error categories over time, visualize category trends, and explore term prevalence within categories. Released as a technical preview in version 7.16, this tool offers extensive opportunities for data exploration and invites user feedback through Elastic's community forums and Slack channels.