How to monitor Hadoop metrics

Company

Datadog

Date Published

July 21, 2016

Author

Evan Mouzakitis

Word count

4830

Language

English

Hacker News points

URL

www.datadoghq.com/blog/monitor-hadoop-metrics

Summary

Hadoop's architecture and subcomponents were previously introduced in a guide to monitoring Hadoop health and performance. This post delves deeper into each technology, exploring the key metrics exposed by Hadoop that should be kept an eye on for optimal cluster operation. The importance of treating DataNodes and NodeManagers like cattle was highlighted, as well as the need for a different mindset when monitoring HDFS compared to other systems. Key metrics include CapacityRemaining, MissingBlocks, VolumeFailuresTotal, NumDeadDataNodes, FilesTotal, TotalLoad, BlockCapacity/BlocksTotal, UnderReplicatedBlocks, and NumStaleDataNodes. Monitoring these metrics can help ensure high cluster availability, prevent data loss, and optimize resource utilization. Additionally, monitoring MapReduce counters such as REDUCE_INPUT_RECORDS, SPILLED_RECORDS, and GC_TIME_MILLIS is essential for identifying performance issues and optimizing application execution. YARN metrics, including unhealthynodes, activeNodes, lostNodes, appsFailed, totalMB/allocatedMB, progress, containersFailed, zk_followers, zk_avg_latency, and zk_num_alive_connections, also provide valuable insights into the health and performance of the cluster. By monitoring these key metrics, cluster administrators can take proactive steps to prevent issues, optimize resource utilization, and ensure high availability.