Categorizing Non-English Log Messages in Machine Learning for Elasticsearch

Company

Elastic

Date Published

Feb. 21, 2018

Author

David Roberts

Word count

1274

Language

Hacker News points

None

URL

www.elastic.co/blog/categorizing-non-english-log-messages-in-machine-learning-for-elasticsearch

Summary

Machine learning in the Elastic Stack has evolved to better handle non-English log messages by improving its categorization capabilities from version 6.2 onwards. Previously, the system assumed log messages were in English, ignoring non-English characters and leading to inaccurate categorization. The updated tokenizer now recognizes words from all alphabets, allowing more accurate categorization of logs in various languages. Customization of the categorization analyzer is possible, including altering token filters and tokenizers to better suit languages with different script structures, such as Chinese or Japanese. Despite these improvements, the system still lacks flexibility in adjusting the dictionary used for token weighting in non-English languages, which can affect the categorization accuracy; however, future updates aim to address this limitation. The changes reflect a significant step forward in providing more accurate and inclusive log message categorization across different languages.