Back to Research
May 16, 2026Dataset4 min read

Meddies QA: A Vietnamese Medical Question-Answer Dataset in Five Domain Groups

A monolingual Vietnamese medical question-answer dataset in five domain groups: 2,941,561 paired QA rows and 7,109,244 question rows, CC-BY-NC research use only.

Meddies Research

Clinical AI research at Meddies

Meddies QA: A Vietnamese Medical Question-Answer Dataset in Five Domain Groups

The problem this release solves is supply. Vietnamese healthcare LLM work needs question-answer data that starts from grounded medical sources, and that supply has been thin. Meddies QA is one answer to that gap.

The supply problem in Vietnamese medical QA

Good medical question-answer data does not begin with questions. It begins with grounded healthcare information: medical articles, drug references, disease explainers, product descriptions, and patient-facing education material. A model trained on QA that floats free of that base learns answer shapes without learning the domain underneath them. The hard part is the conversion. You have to keep the grounding while turning static reference text into the question-and-answer turns a chat model can train on, and you have to do it at a scale that covers more than one corner of medicine.

Meddies QA takes that conversion as its job. It reorganizes Vietnamese medical question-answer data into five domain groups, exposes it as chat-template QA rows, and adds user-only question-bank configs for teams that want to generate their own answers. The release targets supervised fine-tuning, answer-style adaptation, retrieval-answer experiments, and question-only generation pipelines.

How it was built

How Meddies QA is built: literature-grounded medical text becomes Vietnamese questions and answers, organized into five domain groups.

How Meddies QA is built: literature-grounded medical text becomes Vietnamese questions and answers, organized into five domain groups.

The build runs as one argument. Start from literature-grounded medical text. Turn it into Vietnamese questions and answers so a training run can learn domain coverage, answer structure, and medically relevant response patterns from organized examples. Then organize the result so a team can pick its slice by medical area rather than by the source-folder names the original data happened to carry.

That organizing choice is deliberate. The five domain groups are broad on purpose, so users choose compute budgets by medical area instead of by source structure. The release sits between medical-question collections and downstream LLM training. The pipeline takes QA and question-bank inputs, organizes them into training-ready configs, and cuts domain slices for smaller experiments. The answers themselves are machine-generated and cleaned from a source medical corpus, which is the reason the card asks readers to review them before any product, educational, or patient-facing use.

The schema reflects the same restraint. Each row carries messages, question_category, complexity, and domain_group. QA rows hold a user turn then an assistant turn; question rows hold the user turn only. The standalone question, answer, raw_question, and raw_answer fields are dropped from the public training splits, leaving the chat-template field as the one a trainer reads.

What is in it

The repository exposes two config families. The qa_* configs are two-turn chat-template rows with a user question and an assistant answer. The questions_* configs are one-turn user prompts only, included for teams with compute who want to generate or filter their own answers. The default config is qa_all.

ConfigRowsMessage shapeWhat it gives you
qa_all2,941,5612 turnsAll question-answer pairs for chat-template training.
qa_clinical_health1,009,9882 turnsDisease, prevention, anatomy, mother-and-baby, and general clinical-health QA.
qa_pharmaceuticals1,245,4772 turnsMedicines, active substances, vaccines, excipients, herbs, and drug-reference QA.
qa_nutrition_supplements419,3472 turnsNutrition, formula, dietary support, and supplement-product QA.
qa_consumer_health_personal_care197,7042 turnsPersonal-care, cosmetics, and consumer health product QA.
qa_medical_devices_equipment69,0452 turnsMedical equipment, monitoring devices, and assistive product QA.
questions_all7,109,2441 turnAll user-only prompts for answer generation or distillation workflows.
questions_clinical_health3,428,3641 turnDisease, prevention, anatomy, mother-and-baby, and general clinical-health prompts.
questions_pharmaceuticals2,035,1391 turnMedicines, active substances, vaccines, excipients, herbs, and drug-reference prompts.
questions_nutrition_supplements727,0011 turnNutrition, formula, dietary support, and supplement-product prompts.
questions_consumer_health_personal_care845,2001 turnPersonal-care, cosmetics, and consumer health product prompts.
questions_medical_devices_equipment73,5401 turnMedical equipment, monitoring devices, and assistive product prompts.

Two numbers anchor the scale. The QA side holds 2,941,561 chat-template pairs. The question bank holds 7,109,244 user-only prompts, a deeper well for teams that bring their own answer generation. The gap between them is the point of the split: more questions exist than the release answers, so the prompts can drive distillation or filtering at a larger volume than the supervised pairs allow.

Reading the same data by medical area shows where the mass sits. Pharmaceuticals and clinical health carry most of the QA rows, while devices and equipment form the smallest slice on both sides.

Domain groupQA rowsQuestion-bank rows
clinical_health1,009,9883,428,364
pharmaceuticals1,245,4772,035,139
nutrition_supplements419,347727,001
consumer_health_personal_care197,704845,200
medical_devices_equipment69,04573,540

The shape across the two columns is not uniform. Clinical health and consumer health both carry far more prompts than answered pairs, so the question bank is where their depth lives. Pharmaceuticals runs closer to balanced. Devices stays small in both. A team picking a slice should read both columns, not just the QA count.

What it enables, and where it stops

The good fits are concrete. Vietnamese medical QA supervised fine-tuning. Domain-specific loading for pharmaceuticals, clinical health, supplements, personal care, or devices. Answer-generation pipelines that start from question-only prompts. Retrieval or evaluation prototypes that need large Vietnamese medical prompts grouped by broad domain. Loading is a single call:

from datasets import load_dataset

qa = load_dataset("Meddies/meddies-qa", "qa_all", split="train")
clinical = load_dataset("Meddies/meddies-qa", "qa_clinical_health", split="train")
questions = load_dataset("Meddies/meddies-qa", "questions_all", split="train")

The stops matter as much as the fits. This is a research and training dataset, not a clinical decision system, and not medical advice. The assistant answers are machine-generated and can carry clinical, factual, or style errors, which is why review comes before any patient-facing use. The domain groups are broad routing labels, not expert-reviewed medical ontologies, so they sort rows for compute budgeting rather than certify clinical taxonomy. Rejected rows are kept for review but stay out of the main training configs.

Meddies builds Vietnamese healthcare datasets, evaluation systems, and clinician-facing AI infrastructure, and the design intent here is plain: convert grounded medical text into training-ready Vietnamese QA without losing the grounding. The license is cc-by-nc-4.0, non-commercial. Commercial use needs a conversation first, at contact@meddies.ai. The dataset lives at huggingface.co/datasets/Meddies/meddies-qa, and feedback on duplicated rows, unsafe answers, schema issues, or confusing domain placement belongs in a discussion on the repo, with the config name and row evidence attached.