Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Where Does the Signal Live? A Web Data Recipe for Medical Encoder Pretraining

Blog post from HuggingFace

Post Details
Company
Date Published
Author
bofeng huang, Sun Jacques, Diane Bouchacourt, Nicolas Barascud, and Fajwel Fogel
Word Count
2,019
Company Posts That Month
90
Language
-
Hacker News Points
-
Summary

The article explores a novel approach to pretraining medical encoders by leveraging heterogeneous web data instead of traditional hand-curated corpora, which are often limited in scale and diversity, particularly in non-English languages. The proposed methodology utilizes a three-stage data curation pipeline that includes medical-term density filtering, multi-axis annotation, and signal-amplifying rephrasing with large language models (LLMs) to enhance the utility of web-sourced data for medical pretraining. This approach has led to the development of FineMed, a French medical pretraining corpus, and its rephrased subset FineMed-rephrased, alongside the DoctoBERT family of encoders. The evaluation of these encoders demonstrates significant improvements over existing models on a range of French medical natural language processing tasks, highlighting the effectiveness of combining rephrased data with filtered web data to outperform traditional curation methods. The results suggest that this strategy, which taps into the scale and heterogeneity of web data, offers a competitive alternative to narrow, hand-curated corpora, and the study indicates plans to expand the approach to multilingual settings.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
LLM 15 5,172 1,006 220 -43%