Tools to Improve Training Data - Talking Language AI Ep#2

Post Details

Company

Cohere

Date Published

Jan. 23, 2023

Author

Jay Alammar

Word Count

915

Company Posts That Month

9

Language

English

Hacker News Points

-

Post removed?

No

Source URL

cohere.com/blog/tools-to-improve-training-data

Summary

In the second episode of a series on applied NLP, Jay Alammar engages with Vincent Warmerdam, a machine learning engineer at Explosion, to discuss tools designed to enhance training data quality. Vincent, known for his work on NLP tools for the scikit-learn ecosystem, showcases a range of tools aimed at improving data preprocessing and labeling, addressing common issues of poorly labeled datasets, which can lead to good accuracy metrics but faulty predictions. The session highlights tools like Human-learn for building human-based scikit-learn components, Doubtlab for identifying doubtful labels, Embetter for utilizing embeddings in scikit-learn, and Bulk for leveraging bulk labeling through embeddings. These tools are intended to make the data preparation process more transparent and support human involvement more effectively, with the discussion encouraging further exploration and conversation on Discord.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	4	292	59	28	+7%
Vector Search	2	307	67	38	+12%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.