Sizing for Machine Learning with Elasticsearch

Company

Elastic

Date Published

Aug. 3, 2017

Author

Word count

1948

Language

Hacker News points

None

URL

www.elastic.co/blog/sizing-machine-learning-with-elasticsearch

Summary

Elastic's machine learning capabilities are increasingly being utilized by organizations for various analytics projects, including security and operational analytics, with a focus on anomaly detection in time series data. The process of effectively sizing hardware and clusters for machine learning applications in Elasticsearch requires a prior understanding of Elasticsearch sizing, considering variables such as data volume, the number of jobs, and job complexity. Testing these configurations is typically iterative and is best initiated in a lab environment using actual data over synthetic data. Single-metric jobs generally consume fewer resources compared to multivariate jobs, which demand more memory due to the number of variables involved. In production, dedicated machine learning nodes are recommended for optimal performance, with guidelines suggesting at least 4 cores and 64GB RAM per node. Machine learning jobs in Elasticsearch influence data nodes primarily through data retrieval and anomaly score re-normalization processes, with production environments benefiting from multiple dedicated ML nodes to enhance high availability and manage job distribution. Key configuration settings in the elasticsearch.yml file allow for the management of ML nodes and jobs, ensuring that the system can adapt to the demanding requirements of real-time and historical data analysis.