Using perplexity to determine how important data points are in a cluster for training an LLM is a reliable metric. It measures a model's performance by calculating the inverse probability of the true sequence, normalized by the number of words. A lower perplexity score indicates better prediction accuracy and higher confidence, while a higher score suggests less fluency or coherence. By clustering data points using agglomerative clustering and assigning them to clusters based on their embedding similarities, we can identify the most important data points for training. To eliminate irrelevant training data, we calculate the perplexity score of small samples from each cluster, filter out those with low perplexity scores, and retain only the ones with high perplexity scores. This process reduces the dataset size by about 40% while achieving slightly better performance on the model trained on the thinned dataset.