Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding
Blog post from HuggingFace
SPEED-Bench is introduced as a comprehensive benchmark designed to evaluate Speculative Decoding (SD) across diverse semantic domains and realistic serving regimes, using production-grade inference engines. SD is a technique that utilizes a lightweight draft model to speculate multiple future tokens, which a target model then verifies, significantly improving throughput while maintaining the target model's output distribution. SPEED-Bench addresses the shortcomings of existing benchmarks, which often lack semantic diversity and real-world relevance, by combining two purpose-built dataset splits: a Qualitative split optimized for semantic diversity to measure drafter accuracy, and a Throughput split constructed for evaluating system-level speedups across various input sequence lengths and high concurrency. The benchmark includes a unified measurement framework that ensures consistent evaluation across systems by handling tokenization externally and integrating with production engines like TensorRT-LLM and vLLM. SPEED-Bench reveals domain-dependent accuracy and speedups, highlights the effects of optimizations like vocabulary pruning, and corrects the inaccuracies in throughput measurements caused by using random tokens in benchmarks, ultimately aiming to establish a unified standard for evaluating SD in research and production settings.