/plushcap/analysis/cloudflare/mlops

ML Ops Platform at Cloudflare

What's this blog post about?

Cloudflare, an internet security company, has detailed their Machine Learning Operations (MLops) approach that enables them to secure applications and APIs built with AI. They have shared their strategy for creating robust ML models, which includes steps such as data collection, model training, validation, deployment, and monitoring. Their framework is designed to provide a consistent pipeline from data to model, and then model to inference. The company has curated an array of model templates that serve as production-ready data science repositories with example models. These templates are deployed through production to ensure they remain stable foundations for future projects. To start a new project, all it takes is one Makefile command to build a new CICD project in the user's chosen git project. For orchestration, Cloudflare uses Directed Acyclic Graphs (DAGs), which are robust flow chart orchestration paradigms that weave together steps from data to model and then model to inference. They have experimented with different approaches such as Apache Airflow, Argo Workflows, Kubeflow Pipelines, and Temporal. In terms of hardware, the company leverages GPUs for core datacenter workloads and edge inference, and uses observability and metrics consumed by Prometheus to track orchestration performance, maximize hardware utilization, and operate within a Kubernetes-native experience. Adoption is an important aspect of MLops, and Cloudflare has found success when they can help get projects started and shape the pipelines for success. They have shared their components for shared use such as notebooks, orchestration, data versioning (DVC), feature engineering (Feast), and model versioning (MLflow) to enable collaboration across teams. Overall, Cloudflare's MLops approach is designed to help secure applications and APIs built with AI by leveraging the power of their network and providing a consistent pipeline from data to model and then model to inference.

Company
Cloudflare

Date published
Dec. 7, 2023

Author(s)
Keith Adler, Rio Harapan Pangihutan

Word count
1833

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.