Finding errors in datasets with Similarity Search

Company

Qdrant

Date Published

July 18, 2022

Author

George Panchuk

Word count

954

Language

English

Hacker News points

None

URL

qdrant.tech/articles/dataset-quality

Summary

In the context of data categorization, especially in applications like online furniture marketplaces, the process of manually labeling or using classification models to organize data is both error-prone and resource-intensive. Errors in categorization can lead to user dissatisfaction and impact company revenue. To address these challenges, techniques such as similarity search and diversity search can be employed. Similarity search measures semantic similarity between data elements, using embeddings to identify incorrectly categorized items by comparing the distances between vector representations of category titles and item images. Meanwhile, diversity search aims to find the most distinctive examples within a dataset, helping to identify errors that similarity search might miss. These methods can be combined and enhanced through techniques such as caching embeddings or fine-tuning models using similarity learning, thus improving the accuracy and efficiency of data categorization processes.