30% of Google's Emotions Dataset is Mislabeled
Blog post from Surge AI
Google's "GoEmotions" dataset, intended to classify 58,000 Reddit comments into 27 emotions, has been criticized for being significantly mislabeled, with 30% of the data reportedly erroneous. This mislabeling is attributed to issues such as the lack of contextual metadata for comments and the use of data labelers who may not be familiar with US-centric English idioms, culture, or sarcasm. The article highlights specific mislabeling examples, such as interpreting slang or sarcastic comments as negative emotions, which undermines the dataset's reliability for training machine learning models. The critique emphasizes the importance of high-quality data and suggests that Google treated data labeling as an afterthought, failing to consider the complexity and context required for accurate labeling. The article argues for a more sophisticated approach to data labeling, advocating for the involvement of culturally and contextually aware labelers and robust infrastructure to ensure the production of high-quality datasets, which are crucial for developing effective AI models.