Company
Date Published
Author
Anket Sah
Word count
636
Language
English
Hacker News points
None

Summary

Managed Slurm on Lambda is a fully supported Slurm offering purpose-built for fast and seamless deployment on One-Click Clusters. It optimizes cluster utilization for AI/ML workloads, pre-validated on Lambda's 1 Click Cluster, and available exclusively on Lambda's 1 Click Cluster. The offering includes core Slurm capabilities such as latest Lambda-tuned Slurm config, LDAP-backed user/group management, cgroups-based resource policies, container support, Slurm roles, high availability, and pre-installed ML software modules. It also offers managed-only extras like automated Slurm patches & security updates, job history tracking, schedMD partnership for escalated issue resolution, proactive health monitoring, node-failure detection, alerting, and root-cause analysis. The offering is available in both Managed and Unmanaged flavors, with the former providing full HPC support SLAs + SchedMD backup, while the latter offers general infra support only.