Company
Date Published
Author
Rudi Chiarito
Word count
1752
Language
English
Hacker News points
None

Summary

Kubernetes 1.3 introduced preliminary support for GPU scheduling, a feature that Clarifai has contributed to and eagerly adopted to enhance its machine learning workloads. The integration of GPUs into Kubernetes clusters presents a valuable opportunity for offloading computations to highly parallel graphic hardware, despite challenges like varying vendor requirements and potential security issues. Clarifai has been transitioning from virtual machines to a container-based infrastructure using Kubernetes, with a focus on optimizing GPU utilization. Initial efforts concentrated on NVIDIA GPUs, given their prevalence in machine learning and support from major cloud providers. The company has implemented node labeling for resource tracking and developed a basic GPU support model, alpha.kubernetes.io/nvidia-gpu, to facilitate scheduling. While this model has limitations, such as not supporting multiple devices per machine in version 1.3, it allows for more efficient deployment of GPU workloads. Clarifai aims to simplify future deployments by advocating for Docker volume plugin support in Kubernetes, which would streamline the integration of nvidia-docker for managing GPU resources. The company continues to work on enhancing GPU discovery and utilization, aspiring to allow multiple jobs to share GPU resources effectively while maintaining performance standards.