Fireworks Streaming Transcription: 300ms with Whisper-v3-large-quality

Post Details

Company

Fireworks AI

Date Published

Oct. 6, 2025

Author

-

Word Count

1,119

Language

English

Hacker News Points

-

Source URL

fireworks.ai/blog/streaming-audio-launch

Summary

Fireworks has introduced a new streaming speech-to-text API designed for real-time applications such as voice agents and live captioning, featuring an impressive 300ms end-to-end latency for 16kHz mono PCM audio and accuracy within 3% WER of Whisper v3-large. The API is cost-efficient, priced at $0.0032 per audio minute, making it significantly cheaper than competitors. The service is tailored for immediate transcription needs in scenarios like call centers and live broadcasts, providing incremental text segments via a WebSocket connection that streams audio chunks of 50-500ms intervals. Fireworks' custom audio serving stack, built over years with Pytorch, employs optimizations like voice activity detection to manage sparse speech audio efficiently. In addition to speed and accuracy, the service offers production-readiness, supporting companies like Cursor, Uber, and Doordash, and providing serverless customers with a quota of 50 concurrent streams. Fireworks also enables broader compound AI systems by integrating speech with text, image, and specialized models, offering flexibility and adaptability for diverse use cases. Users can begin with the serverless streaming endpoint through code or a UI playground, facilitating ease of use and integration into existing workflows.