Where Does the Signal Live? A Web Data Recipe for Medical Encoder Pretraining
Blog post from HuggingFace
The article explores a novel approach to pretraining medical encoders by leveraging heterogeneous web data instead of traditional hand-curated corpora, which are often limited in scale and diversity, particularly in non-English languages. The proposed methodology utilizes a three-stage data curation pipeline that includes medical-term density filtering, multi-axis annotation, and signal-amplifying rephrasing with large language models (LLMs) to enhance the utility of web-sourced data for medical pretraining. This approach has led to the development of FineMed, a French medical pretraining corpus, and its rephrased subset FineMed-rephrased, alongside the DoctoBERT family of encoders. The evaluation of these encoders demonstrates significant improvements over existing models on a range of French medical natural language processing tasks, highlighting the effectiveness of combining rephrased data with filtered web data to outperform traditional curation methods. The results suggest that this strategy, which taps into the scale and heterogeneity of web data, offers a competitive alternative to narrow, hand-curated corpora, and the study indicates plans to expand the approach to multilingual settings.
| Trend | Post Mentions | Total Month Mentions | Posts | Companies | MoM |
|---|---|---|---|---|---|
| LLM | 15 | 5,172 | 1,006 | 220 | -43% |