Home / Companies / Cleanlab / Blog / Post Details
Content Deep Dive

Improving any OpenAI Language Model by Systematically Improving its Data

Blog post from Cleanlab

Post Details
Company
Date Published
Author
Chris Mauck, Jonas Mueller
Word Count
1,898
Language
English
Hacker News Points
-
Summary

This article demonstrates how data-centric AI tools like cleanlab can improve a fine-tuned Large Language Model (LLM) by optimizing the dataset itself, rather than altering the model architecture or hyperparameters. The authors use the Davinci model from OpenAI and achieve a 37% boost in test-set performance on a politeness classification task after removing data with automatically-flagged label issues and then fine-tuning the LLM on the filtered dataset. Similar gains are achieved for other state-of-the-art LLM models, Ada and Curie. The authors also introduce a no-code solution to efficiently fix label errors in the dataset using Cleanlab Studio, which reduces the error rate of the model by 37%. The article highlights the benefits of data-centric AI tools like cleanlab, which can help systematically engineer better data via automation, freeing up time for domain experts to focus on their unique knowledge.