Scale Your GPU Cloud Infrastructure With Kubernetes

Company

Clarifai

Date Published

Aug. 10, 2016

Author

Rudi Chiarito

Word count

1752

Language

English

Hacker News points

None

URL

www.clarifai.com/blog/scale-your-gpu-cloud-infrastructure-with-kubernetes

Summary

Kubernetes 1.3 introduced preliminary support for GPU scheduling, a feature that Clarifai has contributed to and eagerly adopted to enhance its machine learning workloads. The integration of GPUs into Kubernetes clusters presents a valuable opportunity for offloading computations to highly parallel graphic hardware, despite challenges like varying vendor requirements and potential security issues. Clarifai has been transitioning from virtual machines to a container-based infrastructure using Kubernetes, with a focus on optimizing GPU utilization. Initial efforts concentrated on NVIDIA GPUs, given their prevalence in machine learning and support from major cloud providers. The company has implemented node labeling for resource tracking and developed a basic GPU support model, alpha.kubernetes.io/nvidia-gpu, to facilitate scheduling. While this model has limitations, such as not supporting multiple devices per machine in version 1.3, it allows for more efficient deployment of GPU workloads. Clarifai aims to simplify future deployments by advocating for Docker volume plugin support in Kubernetes, which would streamline the integration of nvidia-docker for managing GPU resources. The company continues to work on enhancing GPU discovery and utilization, aspiring to allow multiple jobs to share GPU resources effectively while maintaining performance standards.