Home / Companies / Anyscale / Blog / Post Details
Content Deep Dive

Monitor and debug Ray workloads with fully persisted Cluster and Actor dashboards on Anyscale

Blog post from Anyscale

Post Details
Company
Date Published
Author
Carolyn Wang
Word Count
2,500
Language
English
Hacker News Points
-
Summary

Anyscale has introduced fully persisted Cluster and Actor Dashboards, enhancing the Ray Dashboard's capability to provide comprehensive monitoring, debugging, and optimization of Ray workloads. This release addresses the limitations of the traditional Ray Dashboard by ensuring data persistence beyond cluster shutdowns, allowing for post-mortem analysis without the need for infrastructure maintenance. The dashboards leverage the Ray Event Export Framework to stream and store cluster events for detailed, long-term insights, enabling developers to debug failures, analyze performance, and compare workloads. A practical example demonstrated how these tools helped diagnose a bottleneck in an audio embedding pipeline, where the concurrent scheduling of CPU-intensive actors on a node with limited CPU slots for GPU tasks led to inefficiencies. The dashboards facilitated the identification and resolution of the issue by providing visibility into actor scheduling and resource allocation, highlighting the importance of observability tools in managing distributed workloads.