Streamlining LLM Inference at the Edge with TFLite
Blog post from Google Cloud
XNNPack, the default TensorFlow Lite CPU inference engine, has been optimized to enhance performance across various platforms by introducing a smarter cache system aimed at reducing inference latency and peak memory usage. This optimization involves repacking static weights into an internal layout for efficient inference computations, but this previously led to increased memory usage due to extra weight copies. The new XNNPack cache provider interface allows for direct saving and loading of packed weights using mmap, significantly reducing startup latency and peak memory usage by eliminating the need for repeated repacking and leveraging virtual memory management. This system also facilitates cross-process weight sharing, improving overall memory efficiency and simplifying the user-facing API by allowing users to specify a cache file path instead of managing cache objects. Cache integrity must be maintained through invalidation when models evolve or XNNPack upgrades occur. Benchmarks show that session initialization is faster with the cache, especially for large language models, due to weight deduplication, although there is no deduplication benefit for standard models like stable diffusion. Future developments aim to utilize data deduplication mechanisms independently from file-backed mappings to further enhance performance.