Benchmarking fast Mistral 7B inference

Post Details

Company

Baseten

Date Published

March 14, 2024

Author

Abu Qader, Pankaj Gupta, Justin Yi, Philip Kiely

Word Count

1,571

Language

English

Hacker News Points

-

Source URL

www.baseten.co/blog/benchmarking-fast-mistral-7b-inference

Summary

Baseten has achieved industry-leading performance for key latency and throughput metrics using Mistral 7B, with a time to first token of under 130 milliseconds, 170 tokens per second, and a total response time of 700 milliseconds. The company's dedicated model deployments offer substantial benefits in terms of privacy, security, and reliability, allowing developers to adjust various settings to optimize for latency, throughput, or cost. By experimenting with different batch sizes and sequence lengths, users can find the optimal configuration for their production workloads, taking into account factors such as infrastructure overhead, tokenization accuracy, and model output value. Baseten's optimized inference engines provide levers to make tradeoffs around these metrics, enabling users to achieve a lower cost at scale than shared endpoint providers.