Improving any OpenAI Language Model by Systematically Improving its Data

Company

Cleanlab

Date Published

June 1, 2023

Author

Chris Mauck, Jonas Mueller

Word count

1898

Language

English

Hacker News points

None

URL

cleanlab.ai/blog/fine-tune-LLM

Summary

This article demonstrates how data-centric AI tools like cleanlab can improve a fine-tuned Large Language Model (LLM) by optimizing the dataset itself, rather than altering the model architecture or hyperparameters. The authors use the Davinci model from OpenAI and achieve a 37% boost in test-set performance on a politeness classification task after removing data with automatically-flagged label issues and then fine-tuning the LLM on the filtered dataset. Similar gains are achieved for other state-of-the-art LLM models, Ada and Curie. The authors also introduce a no-code solution to efficiently fix label errors in the dataset using Cleanlab Studio, which reduces the error rate of the model by 37%. The article highlights the benefits of data-centric AI tools like cleanlab, which can help systematically engineer better data via automation, freeing up time for domain experts to focus on their unique knowledge.