Meddies Consultant is a CC BY-NC 4.0 research dataset with two multi-turn consultation configs, one question-answer config, and one question-only config. Its public card describes a synthetic workflow that begins with patient and disease context, continues through dialogue generation, then normalizes and reviews the result before release.

That is the supported meaning of "built from scratch" here: the published artifact is organized around generated consultations rather than presented as a machine-translated transcript corpus. The card does not publish enough provenance to prove that every upstream input was created without translated or external material.

The release contains four different tasks

The English consultation config contains 109,005 rows, averaging 16.12 turns. The Vietnamese config contains 58,064 rows, averaging 12.33 turns. Each consultation row has a target disease, a patient persona, and a message sequence.

The other two configs do a different job. RandomQA contains 67,372 two-turn question-answer examples. RandomQuestion contains 61,162 question-only prompts. They are useful for narrower experiments, but they do not have the interaction history of a consultation.

This separation matters. A dataset can contain Vietnamese medical text and still teach only one-shot answering. Consultation data has to represent what happens between the complaint and the conclusion: follow-up questions, changes in focus, missing information, and the patient's concerns.

What the card documents about dialogue design

The Consultant card says the generation scope spans 1,236 target diseases and uses interview frames such as OPQRST and FIFE. It also names review criteria for completeness, appropriateness, naturalness, empathy, OPQRST quality, FIFE alignment, structural coherence, and safety.

Those criteria show what the release is trying to preserve. They do not tell us how many rows passed each check, who reviewed them, whether clinicians agreed, or which generator produced each row. The full generation pipeline and reviewer setup are not public, so the supported claim stops at a framework-oriented generation and review process. Reproducing real patient behavior and producing clinically safe conversations would require a published evaluation.

Synthetic patients are specifications, not observations

The dataset does not establish that Vietnamese patients are generally reluctant, understated, or likely to reveal the real complaint only after several questions. Encoding those assumptions as fact would turn a cultural assumption into training data.

A synthetic patient should instead be traceable to an explicit specification. What does the persona know? Which symptom details may be disclosed? What information appears only after a relevant question? Which behaviors are varied, and why? If those rules are not documented, fluent dialogue can hide an arbitrary patient model.

The current public preview exposes another boundary: several rows in the vietnamese config have English patient_persona text with US-specific names, places, and activities. An unlocalized persona can carry that US context into training or evaluation, so a model may learn or be judged in a setting that does not match the patient the dialogue is meant to represent. We examine the mechanism in a separate post. Vietnamese dialogue is present; a fully Vietnamese upstream context is not yet demonstrated.

What this release is for

Meddies Consultant publishes inspectable training data for teams studying consultation structure, follow-up questioning, bilingual transfer, and narrower QA tasks. It is synthetic training data, not a benchmark result, a clinical protocol, or evidence that a model trained on it will consult well.

A model-level test is still missing. Train or evaluate models on the consultation and QA configs separately. Measure whether they ask relevant follow-up questions, preserve patient context, avoid premature conclusions, and fail safely when information is missing. Until those results exist, the dataset shows the shape of the intended training task, not the behavior of the finished model.

Four Training Tasks Inside Meddies Consultant

The release contains four different tasks

What the card documents about dialogue design

Synthetic patients are specifications, not observations

What this release is for

References