How Cloudflare runs machine learning inference in microseconds

Company

Cloudflare

Date Published

June 19, 2023

Author

Austin Hartzheim

Word count

1441

Language

English

Hacker News points

URL

blog.cloudflare.com/how-cloudflare-runs-ml-inference-in-microseconds

Summary

Bot management is a critical component of our platform that helps detect and mitigate bots performing malicious activities on the web. By optimizing memory allocation in the Rust implementation of bot management, we were able to improve performance significantly. Memory allocations are an expensive operation as they require accessing the heap, which can result in cache misses and increased latency. We used several techniques to reduce the number of memory allocations: 1. Avoid unnecessary buffer copies by passing references instead of values. 2. Rewrite algorithms to operate on stack-allocated data whenever possible. 3. Test for zero allocation using dhat, an automated testing tool that counts memory allocations. 4. Optimize decision trees for single-document evaluation rather than multiple documents at once. This reduces the need for additional vector allocations. 5. Reuse buffers by passing references instead of owned data structures like Vecs or Strings. By implementing these optimizations, we reduced P50 latency from 388us to 309us (20%), and P99 latency from 940us to 813us (14%). These improvements make our bot management module faster and more efficient, enhancing the overall performance of our platform. In conclusion, optimizing memory allocation can lead to significant performance gains in real-world applications like our bot management module. By focusing on reducing unnecessary allocations and reusing buffers where possible, we were able to achieve substantial improvements in latency without compromising functionality or readability.