Back to Research
May 24, 2026Dataset5 min read

Meddies Hospital Synthetic: A Vietnamese Clinical-Reasoning Dataset for Multi-Audience Supervised Fine-Tuning

A Vietnamese dataset of 5,214 clinical-reasoning tasks across 8 hospital domains and 4 audiences, distilled from 444,694 candidates at a 1.17% acceptance rate.

Meddies Research

Clinical AI research at Meddies

Meddies Hospital Synthetic: A Vietnamese Clinical-Reasoning Dataset for Multi-Audience Supervised Fine-Tuning

The body is grounded only in the provided README. Returning the finished markdown.

Most clinical LLM evaluation stops at the multiple-choice vignette. Pick the right diagnosis from five options, score the model, move on. That tests one slice of one job. Running a hospital is a wider problem, and almost none of the public Vietnamese training data covers the wider part.

Look at the actual work. Doctors explain pathology to patients in plain Vietnamese. Students learn drug mechanisms before they prescribe. Charge nurses balance bed capacity against surge plans. Administrators read budget variance and write process-improvement memos. A model trained only on MCQ corpora has never seen most of that. The gap is not difficulty. It is coverage. The shape of hospital text, who it is for, and the reasoning behind it are simply absent from vignette datasets.

Meddies Hospital Synthetic is a Vietnamese training set for the rest of that work. It holds 5,214 reasoning tasks across eight hospital competency domains and four audiences, distilled from 444,694 generated samples by a multi-judge rubric. That is a 1.17% acceptance rate. Every row is shaped for chat-format supervised fine-tuning: one <think>...</think> reasoning trace, then a committed final answer, in the style of modern reasoning models.

How a 1.17% acceptance rate happens

How Meddies Hospital Synthetic is built: persona seeding, task and answer generation, self-refinement, and multi-judge evaluation with safety gates that accept 1.17% of candidates.

How Meddies Hospital Synthetic is built: persona seeding, task and answer generation, self-refinement, and multi-judge evaluation with safety gates that accept 1.17% of candidates.

The dataset is an acceptance bucket, not a generation log. The argument for trusting it lives in what got thrown away.

Each row starts with a persona seed: a synthetic Vietnamese patient or learner profile that fixes the scene. Audience, social context, medical background. From that seed, an LLM generates a competency-tagged scenario in one of nine format shapes, with explicit perturbation flags written in on purpose. Negation. Irrelevant context. Unit variation. Unreliable narrator. Information overload. The task is built to be hard before any answer exists.

A second pass produces a structured reasoning trace and a draft answer. Then up to two remediation loops critique and rewrite that draft, and the final committed answer is what closes the row. The self-correction is preserved inside the <think> block as natural Vietnamese prose, "Để tôi xem xét lại câu trả lời của mình.", "Let me reconsider my answer", rather than as synthetic XML. A model trained on this learns the move, not the tag.

The last stage is the filter. Domain-specific rubrics, for example DIAGNOSTIC_REASONING, score each answer against safety-pass gates and dimension-level criteria. Only rows that pass every judge reach accepted, and the published subset is exactly that bucket. 444,694 candidates went in. 5,214 came out. The acceptance rate is the dataset's main quality signal, and it is the reason the release is small.

What the rows actually contain

Each row is a Vietnamese clinical or hospital-management task with a fully reasoned answer, and the schema loads directly into a chat-format trainer. The fields carry the structure a fine-tuning run needs.

FieldTypeWhat it gives you
messageslist<{role, content}>Two-turn chat: user prompt, assistant answer with <think> reasoning
questionstringThe user prompt, also available as messages[0].content
domainstringOne of 8 top-level competency areas (see below)
categorystringSub-category leaf (e.g. chief_complaint_analysis, bed_capacity_management)
audiencestringPATIENT, DOCTOR, STUDENT, or MANAGER — who the answer is for
difficultystring5 levels from LEVEL_1_BASIC to LEVEL_5_EDGE_CASES
format_typestringOne of 9 task shapes (long answer, list, MCQ, calculation, procedure, …)
optionslist<struct> or nullMCQ choices when format_type is MCQ_SINGLE or MCQ_MULTIPLE
perturbationstructAdversarial-robustness flags (negation, irrelevant context, unit variation, …)
id, created_atstringRow identity and generation timestamp

The eight domains are uneven, and the imbalance is the honest picture of hospital text. Communication and documentation carries more than half the rows because patient-facing prose is the bulk of what a hospital writes.

DomainRowsWhat it covers
Communication & Documentation2,605Patient education, bad-news delivery, shared decision-making, SOAP notes, discharge summaries, referral letters, handoff
Clinical Reasoning1,079Chief complaint analysis, differentials, pattern recognition, lab/imaging/ECG interpretation, triage and admission decisions
Quantitative Skills560Anthropometric/fluid/pediatric/obstetric calculations, severity and prognostic scores, weight-based and organ-adjusted dosing
Medical Sciences478Disease and drug mechanisms, pharmacokinetics, physiology, immunology, risk-factor and epidemiology basics
Hospital Operations163Bed capacity, staff scheduling, budget variance, accreditation, surge planning, business continuity, legal-risk management
Procedures & Diagnostics158Test and imaging selection, resuscitation, airway, vascular access, emergency procedures
Ethics & Safety138Autonomy, informed consent, treatment refusal, truth-telling, error prevention, medication safety
Therapeutics & Management33Initial management, medication selection, response monitoring, follow-up, chronic-disease management

Audience is structural, not a label

The four audiences are not a tagging convenience. The audience-domain pairing is built in: a doctor never gets a patient-education task, a manager never gets a clinical-reasoning case. Each row is written for one specific reader, which is what makes audience-conditioned experiments possible. Same domain, different reader, different answer.

AudienceRowsDomains it covers
PATIENT2,241Communication & Documentation (patient-facing prose)
DOCTOR1,772Clinical Reasoning, clinical Documentation, Procedures, Therapeutics, Ethics
STUDENT1,038Medical Sciences, Quantitative Skills
MANAGER163Hospital Operations

Difficulty spans the full curriculum rather than clustering at the easy or expert ends, and the format mix favours open-ended generation while keeping an MCQ slice for compatibility with vignette-style benchmarks. The numbers below come from the dataset card, not from an independent audit.

SliceRows%
LEVEL_1_BASICLEVEL_2_INTERMEDIATE2,54548.8%
LEVEL_3_ADVANCEDLEVEL_4_EXPERT2,30944.3%
LEVEL_5_EDGE_CASES3606.9%
Open-ended (LONG_ANSWER, LIST_GENERATION, SHORT_ANSWER, CASE_ANALYSIS)3,91875.1%
MCQ (MCQ_SINGLE, MCQ_MULTIPLE)3546.8%
Calculation, Procedure, Consultation94218.1%

What it enables, and where it stops

This release is built for SFT, not eval. A good fit is fine-tuning a Vietnamese reasoning model to handle hospital work rather than clinical Q&A alone: multi-audience instruction tuning across patient, doctor, student, and manager, hospital-management capability training, mixed open-ended and MCQ supervision, and reasoning-trace alignment through the preserved <think> blocks. The audience tags also support audience-conditioned generation experiments.

The limits are real and stated plainly. This is synthetic data, generated and judged by LLMs. It is not a benchmark, not a prevalence reference, and not a clinical decision tool. The acceptance gate is rubric-based, not clinician-reviewed at the row level, so some rows reflect generator priors: Vietnamese cultural framing, drug-naming conventions, ICD-10 coding habits. The domain mix is uneven by design, and a model fine-tuned on it will need extra balancing if downstream eval weights all domains equally. Hospital Operations is informative but small at 163 rows. Treat it as seed material for further generation, not a finished management corpus.

The dataset lives at huggingface.co/datasets/Meddies/meddies-hospital-synthetic under the cc-by-nc-4.0 license. The most useful thing you can send back is failure: rows where the reasoning is wrong, audience-tone mismatches, missing Vietnamese clinical context, management scenarios that do not match a real Vietnamese hospital, judge-rubric blindspots. For commercial use or collaboration, contact contact@meddies.ai.