Home / Companies / Stream / Blog / Post Details
Content Deep Dive

How to Achieve a 9ms Inference Time for Transformer Models

Blog post from Stream

Post Details
Company
Date Published
Author
Bhaskar
Word Count
1,166
Language
English
Hacker News Points
-
Summary

Stream's AI Moderation Platform aims to detect and act on harmful content in real-time, crucial for services like live streaming and instant messaging. To achieve this, Stream focuses on optimizing machine learning models for low latency and high throughput, ensuring swift message processing before public visibility. They utilize techniques such as knowledge distillation and model quantization, which significantly reduce the memory footprint and inference time of models like BERT, making them suitable for real-time predictions. Stream also employs hardware-specific optimizations, particularly on cost-effective CPU instances, achieving comparable inference times to GPUs. Real-time optimizations include skipping padding for shorter messages, enhancing throughput. With these strategies, Stream's infrastructure supports high message throughput with consistently low latency, promoting proactive content moderation and positive user behavior on their platform. Future improvements may include further model pruning and exploring additional hardware optimizations.