Back to Research
May 12, 2026Dataset8 min read

Bypassing the QA Trap: Introducing the meddies-consultant Dataset

meddies-consultant is an open dataset of synthetic Vietnamese and English clinical consultations, built so models learn to consult rather than quiz.

Meddies Research

Clinical AI research at Meddies

Bypassing the QA Trap: Introducing the meddies-consultant Dataset

Here is meddies-consultant: an open-source dataset of synthetic Vietnamese clinical consultations.

We built this because there is a massive shortage of privacy-safe clinical data. Instead of just prompting an LLM and hoping for the best, we approached data generation as a strict mechanism design. We used reasoning models to build medical dialogues from the ground up, which gives us total control over the accuracy of the conversations without ever risking real patient data.

1. The Problem We're Solving

Look at most medical AI models today and you'll spot the same flaw. They are trained to pass licensing exams, not talk to patients.

Historically, healthcare AI runs on single-turn Q&A data. The user inputs a symptom, and the model fires back a diagnosis. You end up with a great medical encyclopedia that is completely useless in a real clinic.

Doctors don't hear "chest pain" and instantly diagnose a heart attack. They consult. They dig into the nature of the pain, gauge the patient's anxiety, and rule out edge cases.

If we want to build ambient scribes or triage agents that actually work, we have to stop teaching models to "answer." We need to teach them to consult.

The Privacy Wall

The best way to fix this would be using real doctor-patient transcripts. But you can't. Medical records are full of protected personal info. The law won't let you open-source them. So, engineers are stuck sitting on powerful algorithms with no actual conversational data to train them on.

2. Why Vietnamese?

The medical AI space is almost entirely English-first. When people try to build for other languages, they usually just scrape and translate Western datasets.

In healthcare, translation artifacts are dangerous. The way patients talk about pain and anxiety doesn't map cleanly across borders. Take Vietnamese: the nuances between đau nhức, đau buốt, and đau thắt are completely lost when you flatten them into the English equivalents of aching, sharp, and tight. On top of that, you have to account for cultural stoicism. Patients here report symptoms differently. If your AI doesn't understand that, it will fail in a local clinic.

Vietnamese has over 100 million native speakers. We didn't want to rely on translated leftovers, so we built meddies-consultant natively from the ground up.

3. Designing the Dataset from First Principles

Anyone can write a script to generate a million rows of text. But if you just unleash an LLM on a clinical prompt, you get garbage. The models hallucinate. They become overly agreeable, medically superficial, and completely ignore how a real consultation flows.

To fix this, we stopped writing prompts and started designing mechanisms. We forced the generation pipeline to strictly follow three standard medical frameworks:

  • The Calgary-Cambridge Guide: Dictates the actual arc of the conversation (opening the session, gathering info, closing).

  • OPQRST: Forces the AI doctor to systematically drill down into the symptoms (Onset, Provocation, Quality, etc.).

  • FIFE: Programs the patient's internal state (Feelings, Ideas, Function, Expectations).

That last one is the secret. We injected FIFE directly into the synthetic patient personas. Real patients aren't clean data points. They get anxious, they ramble, and they need redirection. FIFE ensures our synthetic patients act like messy humans, while OPQRST ensures the AI doctor stays rigorously analytical.

4. What's Inside: The Four Configs

We split meddies-consultant into four configs. We didn't want to just dump raw text files, so we formatted everything as structured artifacts ready for training runs.

ConfigRowsUser messagesAssistant messages
english109,005826,308930,683
vietnamese58,064329,728386,082
RandomQA67,37267,37267,372
RandomQuestion61,16261,1620

The Vietnamese & English subsets are the core multi-turn dialogues. If you look at the messages-to-rows ratio, you'll see these aren't shallow, single-prompt interactions. They are deep, sustained conversations.

We included RandomQA for standard factual supervision. But the most interesting split is RandomQuestion. There are zero assistant messages here. It is a highly specialized set optimized entirely for training active-listening models, forcing the AI to figure out what to ask next, rather than just generating an answer.

5. The Generation Pipeline

Left to their own devices, models drift. To stop the LLMs from hallucinating generic medical advice, we locked them inside a strict, multi-agent pipeline.
Step order dictates everything. The tighter we make the upstream constraints, the better the final consultation flows.

Figure 1 — The meddies-consultant generation pipeline

👤

Persona Design

FIFE framework

🧬

Disease Scope

1,236-condition taxonomy

01

Context

Disease + FIFE persona

02

Grounding

OPQRST · Calgary-Cambridge

03

Generation

Multi-agent dialogue drafting

04

Review Gate

JSON normalisation · clinical safety filter

English

109,005 rows · multi-turn

Vietnamese

58,064 rows · multi-turn

RandomQA

67,372 rows · supervised

RandomQuestion

61,162 rows · active-listening

Step order is strict — tighter upstream constraints produce higher-fidelity consultations downstream.

  1. Context: We select a specific disease from a 1,236-condition taxonomy and build out a messy, realistic patient persona using FIFE.
  2. Grounding: We force those raw inputs through the OPQRST and Calgary-Cambridge frameworks.
  3. Generation: The models actually draft the dialogues, but they are completely boxed in by the rules set in the previous step.
  4. Review Gate*: A script normalizes the JSON and filters the output for clinical safety.
  5. Deployment: If it passes the gate, it gets split into the four datasets.

6. Quality Over Quantity

Right now, the trend is to brag about dataset size. We aren't doing that. If you want to train clinical reasoning, 50,000 rows of rigorous, high-fidelity dialogue will completely out-perform 10 million rows of hallucinated garbage.

We score every subset heavily. We test if the medical extraction is complete, if the patient actually sounds human, and if the conversation structure holds up over multiple turns.

One hard rule: this is synthetic data. Use it to teach models how to listen, extract information, and hold a conversation. Do not use it as a substitute for human medical judgment.

7. How to use it

You don't need to clean or format this; it is ready for your pipeline.

The most obvious use case is instruction tuning your base models. But it is also highly effective for red-teaming. You can take your existing hospital chatbot and test it against our messy, unpredictable FIFE personas to see how it handles a difficult patient. Finally, the structured output makes it great for training ambient scribes to write reliable SOAP notes.

8. What We're Building Toward

meddies-consultant isn't a one-off data dump. It's the baseline infrastructure we need to get AI working safely in Vietnamese hospitals. The math here is straightforward: if we can train models to reliably handle the charting, doctors can finally stop staring at screens and go back to looking at their patients.

9. Get Involved

Clinical AI fails when it's built in a silo. We need this tested in the wild.

Download the data. Run it through your pipeline. Try to break it. When you find where it fails, let us know so we can fix it.

Access Data on Hugging Face