Company
Date Published
Author
Abu Qader, Pankaj Gupta, Justin Yi, Philip Kiely
Word count
1571
Language
English
Hacker News points
None

Summary

Baseten has achieved industry-leading performance for key latency and throughput metrics using Mistral 7B, with a time to first token of under 130 milliseconds, 170 tokens per second, and a total response time of 700 milliseconds. The company's dedicated model deployments offer substantial benefits in terms of privacy, security, and reliability, allowing developers to adjust various settings to optimize for latency, throughput, or cost. By experimenting with different batch sizes and sequence lengths, users can find the optimal configuration for their production workloads, taking into account factors such as infrastructure overhead, tokenization accuracy, and model output value. Baseten's optimized inference engines provide levers to make tradeoffs around these metrics, enabling users to achieve a lower cost at scale than shared endpoint providers.