Company
Date Published
Author
-
Word count
3332
Language
English
Hacker News points
None

Summary

CrowdStrike extensively employs Apache Airflow to automate and manage machine learning workflows, enhancing its ability to process large volumes of data critical for cybersecurity operations. Originally developed by Airbnb and open-sourced in 2015, Airflow facilitates the creation, scheduling, and monitoring of data pipelines using Directed Acyclic Graphs (DAGs). At CrowdStrike, Airflow is deployed in a multi-node setup using Celery executors, Docker containers, and Terraform for infrastructure provisioning, while Chef manages instance provisioning and configuration. This setup enables daily updates to their data corpus, ensuring data scientists have the most current data available. The company has faced challenges such as logging issues and job management, which they have addressed by employing shared file systems and remote logging. CrowdStrike's approach underscores the importance of planning for scalability, security, and seamless transition from development to production environments. The company aims to eventually shift its machine learning workloads into a Kubernetes cluster, further enhancing its deployment capabilities.