/plushcap/analysis/cloudflare/data-generation-and-sampling-strategies

Improving the accuracy of our machine learning WAF using data augmentation and sampling

What's this blog post about?

Cloudflare developed a machine learning based Web Application Firewall (WAF) solution to enhance security and performance for its customers. The WAF analyzes HTTP requests to identify potential malicious content or intent, protecting against common attack vectors such as cross-site-scripting (XSS), file inclusion, and SQL injection. To overcome the challenge of obtaining high quality labeled data, Cloudflare employed data augmentation and generation techniques. These methods allowed them to create a diverse dataset covering various malicious samples for all attack categories, benign samples, and obfuscation techniques. Data augmentation involved generating artificial but realistic data by studying statistical distributions of existing real-world data. The use of pseudo-random noise samples was particularly effective in improving the model's performance. By creating a series of token sampling distributions that made it increasingly difficult for the model to distinguish them from a real payload, they were able to significantly reduce false positives and improve overall robustness. After implementing data augmentation techniques, the machine learning WAF demonstrated comparable performance to Cloudflare's signature-based WAF, with particular strengths in handling highly obfuscated or irregularly fuzzed content. The results showed a significant improvement in model performance, as indicated by an F1 score of 0.99 after augmentation compared to 0.61 before. In conclusion, data augmentation and generation techniques played a crucial role in improving the machine learning WAF's performance and inducing the right set of properties. Cloudflare plans to further investigate autoregressive language models for generating synthetic pseudo-valid payloads in the future.

Company
Cloudflare

Date published
Sept. 5, 2022

Author(s)
Vikram Grover

Word count
2660

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.