Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Talor Abramovich, Maor Ashkenazi, Izzy Putterman, Benjamin Chislett, Tiyasa Mitra, Bita Rouhani, Ran Zilberstein, and Yonatan Geifman
Word Count
2,333
Language
-
Hacker News Points
-
Summary

SPEED-Bench is introduced as a comprehensive benchmark designed to evaluate Speculative Decoding (SD) across diverse semantic domains and realistic serving regimes, using production-grade inference engines. SD is a technique that utilizes a lightweight draft model to speculate multiple future tokens, which a target model then verifies, significantly improving throughput while maintaining the target model's output distribution. SPEED-Bench addresses the shortcomings of existing benchmarks, which often lack semantic diversity and real-world relevance, by combining two purpose-built dataset splits: a Qualitative split optimized for semantic diversity to measure drafter accuracy, and a Throughput split constructed for evaluating system-level speedups across various input sequence lengths and high concurrency. The benchmark includes a unified measurement framework that ensures consistent evaluation across systems by handling tokenization externally and integrating with production engines like TensorRT-LLM and vLLM. SPEED-Bench reveals domain-dependent accuracy and speedups, highlights the effects of optimizations like vocabulary pruning, and corrects the inaccuracies in throughput measurements caused by using random tokens in benchmarks, ultimately aiming to establish a unified standard for evaluating SD in research and production settings.