Human Evaluation of Large Language Models: How Good is Hugging Face's BLOOM?
Blog post from Surge AI
Hugging Face's BLOOM, a multilingual large language model with 176 billion parameters, underwent an evaluation to assess its real-world application performance, revealing several challenges and limitations. Despite BLOOM being trained openly with contributions from over 1,000 researchers across 70 countries, its performance in human evaluations on tasks such as categorizing toxic speech, creative writing, question-answering, and marketing copywriting did not consistently match up to the expectations set by traditional academic benchmarks. The study highlighted the shortcomings of existing benchmarks, which often fail to capture the nuanced, creative, and practical abilities of language models, especially in areas like humor and serendipity. BLOOM's results in human evaluations indicated that while it showed potential in programming tasks, it struggled with consistency and accuracy in other areas, suggesting that the model's development could benefit from improved benchmarks and evaluation criteria that consider real-world applicability. The authors propose open-sourcing a dataset designed for human evaluation and establishing guidelines to better assess language model outputs, encouraging a collaborative approach to refining these models and their assessment methods.