Ray on Alibaba Cloud: Building an ML Platform

Company

Anyscale

Date Published

June 12, 2025

Author

Kun Wu (Alibaba Cloud)

Word count

1545

Language

English

Hacker News points

None

URL

www.anyscale.com/blog/ray-on-alibaba-cloud-building-an-ml-platform

Summary

Ray is an open-source distributed computing engine that precisely orchestrates infrastructure for any distributed workload on any accelerator at any scale. It consists of three layers: Ray Core, Ray AI Libraries, and Ray Deployment. Ray Core provides a small set of essential primitives (task, actor, and object) for building and scaling distributed applications, allowing users to program distributed applications with the Ray Core API just like programming on a laptop. The ecosystem includes various AI libraries such as Ray Data, Ray Train, Ray Tune, Ray Serve, and RLlib that cover the ML lifecycle, from data processing to training to tuning to serving. KubeRay is a Ray Kubernetes operator that simplifies the management of the lifecycle of Ray clusters and associated applications on Kubernetes, enabling data scientists and ML scientists to focus on their machine learning logic while infra engineers concentrate on Kubernetes. ACK (Alibaba's Container Service for Kubernetes) supports KubeRay as a managed component, offering advantages such as elastic compute, observability, security, zero operations and maintenance, high availability deployment, and resource policy API to orchestrate compute resource types by defining priorities of node preferences. The native Ray Dashboard is available only while Ray clusters are running, but ACK provides the Ray History Server for access to dashboards for both active and terminated RayCluster custom resources. By leveraging kube-queue integration and the Ray History Server, users can successfully operationalize Ray workloads in production while utilizing resources efficiently.