Stop benchmarking inference providers
Blog post from HuggingFace
Nathan Habib discusses the inefficacy of benchmarking models through inference providers, arguing that it results in benchmarking the provider rather than the model itself due to potential alterations like quantization or different prompting. Emphasizing that the Transformer framework should define both model and evaluation, he introduces a method to use the Hugging Face (HF) hub and open-source libraries for more reliable benchmarking across millions of models. Habib outlines a process that involves utilizing HF Jobs to provide on-demand compute and running a Unified Virtual (UV) script to set up a server for model benchmarking. This approach incorporates dependencies like inspect-ai and OpenAI, and allows users to manage evaluation parameters and publish results to the HF space. By employing this method, users can efficiently evaluate models using standard benchmarks such as the GPQA Diamond, and potentially contribute their results to community leaderboards on the HF hub.