How to Achieve a 9ms Inference Time for Transformer Models

Post Details

Company

Stream

Date Published

Feb. 14, 2023

Author

Bhaskar

Word Count

1,166

Language

English

Hacker News Points

-

Source URL

getstream.io/blog/optimize-transformer-inference

Summary

Stream's AI Moderation Platform aims to detect and act on harmful content in real-time, crucial for services like live streaming and instant messaging. To achieve this, Stream focuses on optimizing machine learning models for low latency and high throughput, ensuring swift message processing before public visibility. They utilize techniques such as knowledge distillation and model quantization, which significantly reduce the memory footprint and inference time of models like BERT, making them suitable for real-time predictions. Stream also employs hardware-specific optimizations, particularly on cost-effective CPU instances, achieving comparable inference times to GPUs. Real-time optimizations include skipping padding for shorter messages, enhancing throughput. With these strategies, Stream's infrastructure supports high message throughput with consistently low latency, promoting proactive content moderation and positive user behavior on their platform. Future improvements may include further model pruning and exploring additional hardware optimizations.