Using Perplexity to eliminate known data points

Company

Monster API

Date Published

Oct. 3, 2024

Author

Gaurav Vij

Word count

973

Language

English

Hacker News points

None

URL

blog.monsterapi.ai/using-perplexity-to-eliminate-known-data-points

Summary

Using perplexity to determine how important data points are in a cluster for training an LLM is a reliable metric. It measures a model's performance by calculating the inverse probability of the true sequence, normalized by the number of words. A lower perplexity score indicates better prediction accuracy and higher confidence, while a higher score suggests less fluency or coherence. By clustering data points using agglomerative clustering and assigning them to clusters based on their embedding similarities, we can identify the most important data points for training. To eliminate irrelevant training data, we calculate the perplexity score of small samples from each cluster, filter out those with low perplexity scores, and retain only the ones with high perplexity scores. This process reduces the dataset size by about 40% while achieving slightly better performance on the model trained on the thinned dataset.