30% of Google's Emotions Dataset is Mislabeled

Post Details

Company

Surge AI

Date Published

July 11, 2022

Author

Edwin Chen

Word Count

1,996

Language

English

Hacker News Points

-

Source URL

surgehq.ai/blog/30-percent-of-googles-reddit-emotions-dataset-is-mislabeled

Summary

Google's "GoEmotions" dataset, intended to classify 58,000 Reddit comments into 27 emotions, has been criticized for being significantly mislabeled, with 30% of the data reportedly erroneous. This mislabeling is attributed to issues such as the lack of contextual metadata for comments and the use of data labelers who may not be familiar with US-centric English idioms, culture, or sarcasm. The article highlights specific mislabeling examples, such as interpreting slang or sarcastic comments as negative emotions, which undermines the dataset's reliability for training machine learning models. The critique emphasizes the importance of high-quality data and suggests that Google treated data labeling as an afterthought, failing to consider the complexity and context required for accurate labeling. The article argues for a more sophisticated approach to data labeling, advocating for the involvement of culturally and contextually aware labelers and robust infrastructure to ensure the production of high-quality datasets, which are crucial for developing effective AI models.