Monitor and debug Ray workloads with fully persisted Cluster and Actor dashboards on Anyscale
Blog post from Anyscale
Anyscale has introduced fully persisted Cluster and Actor Dashboards, enhancing the Ray Dashboard's capability to provide comprehensive monitoring, debugging, and optimization of Ray workloads. This release addresses the limitations of the traditional Ray Dashboard by ensuring data persistence beyond cluster shutdowns, allowing for post-mortem analysis without the need for infrastructure maintenance. The dashboards leverage the Ray Event Export Framework to stream and store cluster events for detailed, long-term insights, enabling developers to debug failures, analyze performance, and compare workloads. A practical example demonstrated how these tools helped diagnose a bottleneck in an audio embedding pipeline, where the concurrent scheduling of CPU-intensive actors on a node with limited CPU slots for GPU tasks led to inefficiencies. The dashboards facilitated the identification and resolution of the issue by providing visibility into actor scheduling and resource allocation, highlighting the importance of observability tools in managing distributed workloads.