Improving the accuracy of our machine learning WAF using data augmentation and sampling

Company

Cloudflare

Date Published

Sept. 5, 2022

Author

Vikram Grover

Word count

2660

Language

English

Hacker News points

None

URL

blog.cloudflare.com/data-generation-and-sampling-strategies

Summary

Cloudflare developed a machine learning based Web Application Firewall (WAF) solution to enhance security and performance for its customers. The WAF analyzes HTTP requests to identify potential malicious content or intent, protecting against common attack vectors such as cross-site-scripting (XSS), file inclusion, and SQL injection. To overcome the challenge of obtaining high quality labeled data, Cloudflare employed data augmentation and generation techniques. These methods allowed them to create a diverse dataset covering various malicious samples for all attack categories, benign samples, and obfuscation techniques. Data augmentation involved generating artificial but realistic data by studying statistical distributions of existing real-world data. The use of pseudo-random noise samples was particularly effective in improving the model's performance. By creating a series of token sampling distributions that made it increasingly difficult for the model to distinguish them from a real payload, they were able to significantly reduce false positives and improve overall robustness. After implementing data augmentation techniques, the machine learning WAF demonstrated comparable performance to Cloudflare's signature-based WAF, with particular strengths in handling highly obfuscated or irregularly fuzzed content. The results showed a significant improvement in model performance, as indicated by an F1 score of 0.99 after augmentation compared to 0.61 before. In conclusion, data augmentation and generation techniques played a crucial role in improving the machine learning WAF's performance and inducing the right set of properties. Cloudflare plans to further investigate autoregressive language models for generating synthetic pseudo-valid payloads in the future.