Key metrics for monitoring Karpenter
Blog post from Datadog
Karpenter's architecture facilitates just-in-time provisioning and active node consolidation, making it crucial to monitor specific metrics to understand its behavior and performance. These metrics, which are categorized as STABLE, BETA, ALPHA, or DEPRECATED, provide insights into Karpenter's scheduling, disruption, cloud provider interactions, controller internals, and cost optimization processes. For instance, metrics such as karpenter_pods_startup_duration_seconds and karpenter_scheduler_scheduling_duration_seconds help gauge the efficiency of the scaling process and the potential causes of latency. Additionally, metrics like karpenter_voluntary_disruption_eligible_nodes and karpenter_nodeclaims_termination_duration_seconds reveal opportunities for cost savings and highlight challenges in node management. Monitoring cloud provider metrics, such as karpenter_cloudprovider_errors_total and karpenter_cloudprovider_duration_seconds, can help identify issues stemming from API failures or latency. Finally, metrics related to controller performance, including controller_runtime_reconcile_time_seconds and workqueue_depth, provide insights into Karpenter's ability to manage cluster changes efficiently. By correlating these metrics, users can ensure that Karpenter maintains optimal performance and cost-efficiency.