Company
Date Published
Author
Alan Guo, Cuong Nguyen, Justin Yu, Matthew Deng, Matthew Owen and Richard Liaw
Word count
1951
Language
English
Hacker News points
None

Summary

The new Ray Train Dashboard and Ray Data Dashboard are two purpose-built observability dashboards designed to help ML engineers focus on model training and data processing logic, while providing a unified interface for accessing logs and metrics. The Ray Train Dashboard offers four critical observability features: training progress, error attribution, logs/metrics, and profiling, enabling users to visualize and understand what's happening in their distributed training jobs at different altitudes. It provides a unified experience, rich error context, one-click profiling, and workload-level abstractions to help identify performance bottlenecks and optimize performance. The Ray Data Dashboard integrates Tree and DAG views for pipeline drilldowns, operation-level metrics, and dataset-aware log aggregations, making it easier to quickly identify bottlenecks and optimize performance for data pipelines. Both dashboards are actively evolving to provide even more value, with planned enhancements including automated issue detection, integration with the Ray Train Dashboard, support for experiment tracking, and more.