Improving data quality with confidence
Blog post from Refuel
Leveraging large language models (LLMs) for data labeling necessitates accurately estimating the model's confidence in its own outputs to reject low-confidence labels and optimize ensemble strategies. By exploring various techniques for confidence estimation, the study found that token-level generation probabilities, commonly referred to as "logprobs," are the most accurate method, while prompting the LLM to produce a confidence score is notably unreliable. The research utilized Autolabel, an open-source library, to conduct experiments on a range of NLP tasks and demonstrated that token probabilities achieved the highest AUROC scores across different datasets. This study emphasizes the importance of confidence estimation in improving data labeling accuracy and provides insights into future enhancements through fine-tuning verifier LLMs. Additionally, the library supports confidence score computation by integrating with Refuel's Verifier LLM for models lacking native logprob extraction capabilities.