How to run LLM performance benchmarks (and why you should)
Blog post from Baseten
Large Language Models (LLMs) are complex tools that emulate human behavior, making their performance challenging to evaluate due to various interacting factors such as model type, hardware, and workload. SemiAnalysis has developed InferenceMAX, a benchmark focusing on inference speed across common hardware configurations, offering a reference point for the community. However, these benchmarks typically assess generic workloads, and for precise insights, users should conduct their own benchmarks tailored to their data. This article details replicating InferenceMAX on Baseten, utilizing TensorRT-LLM, and explores how the Baseten Inference Stack (BIS) can enhance performance through techniques like speculative decoding. Key patterns for effective model evaluation are provided, emphasizing the importance of server-side benchmarking to eliminate network variability and iterative benchmarking processes to refine configurations. The article highlights the significance of dataset selection, using production, public, and synthetic data for comprehensive performance metrics. While benchmarking can be complex, it is crucial for optimizing models and ensuring user satisfaction. The article suggests that a well-structured benchmarking approach serves as an early-alerting system, guiding decisions about models, configurations, and providers, and previews future exploration of realistic datasets and advanced techniques for improving inference performance.