Qdrant for Research: The Story Behind ETH & Stanford’s MIRIAD Dataset

Post Details

Company

Qdrant

Date Published

July 23, 2025

Author

Evgeniya Sukhodolskaya & Daniel Azoulai

Word Count

983

Language

English

Hacker News Points

-

Source URL

qdrant.tech/blog/miriad-qdrant

Summary

Researchers from ETH Zurich and Stanford have developed MIRIAD, an extensive open-source dataset consisting of 5.8 million medical question-answer pairs, each grounded in peer-reviewed literature, to address the lack of structured, high-quality data in medical AI. This dataset, built on the Semantic Scholar Open Research Corpus, aims to mitigate hallucinations in medical AI applications by providing a rich, context-driven knowledge base for Retrieval Augmented Generation (RAG) and enhancing embedding models. Qdrant, chosen for its simplicity, speed, scalability, and open-source nature, plays a crucial role in powering MIRIAD's storage and retrieval experiments. The dataset has demonstrated improvements in medical QA benchmarks and hallucination detection capabilities, and it is openly available for replication and benchmarking on HuggingFace. The researchers aim to keep MIRIAD updated annually, with plans for further integration with Qdrant and potential applications in medical AI, such as medical QA agents and discipline explorers.