Where Does the Signal Live? A Web Data Recipe for Medical Encoder Pretraining

Post Details

Company

HuggingFace

Date Published

June 20, 2026

Author

bofeng huang, Sun Jacques, Diane Bouchacourt, Nicolas Barascud, and Fajwel Fogel

Word Count

2,019

Company Posts That Month

90

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/bofenghuang/doctobert-fr-release

Summary

The article explores a novel approach to pretraining medical encoders by leveraging heterogeneous web data instead of traditional hand-curated corpora, which are often limited in scale and diversity, particularly in non-English languages. The proposed methodology utilizes a three-stage data curation pipeline that includes medical-term density filtering, multi-axis annotation, and signal-amplifying rephrasing with large language models (LLMs) to enhance the utility of web-sourced data for medical pretraining. This approach has led to the development of FineMed, a French medical pretraining corpus, and its rephrased subset FineMed-rephrased, alongside the DoctoBERT family of encoders. The evaluation of these encoders demonstrates significant improvements over existing models on a range of French medical natural language processing tasks, highlighting the effectiveness of combining rephrased data with filtered web data to outperform traditional curation methods. The results suggest that this strategy, which taps into the scale and heterogeneity of web data, offers a competitive alternative to narrow, hand-curated corpora, and the study indicates plans to expand the approach to multilingual settings.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	15	5,172	1,006	220	-43%