In healthcare, data is unusually sensitive: every real record is tied to a patient's private information, so it can rarely be used freely to train a model. Many teams answer this by having language models generate synthetic data instead, and models are now capable enough to do it. We took the same route and started building our own synthetic medical datasets.

While building the Meddies Consultant dataset, though, we ran into a missing piece. To generate consultations that are both diverse and realistic, we needed a rich set of patient personas so that every consultation had a concrete person behind it. One of the better options at the time was Nemotron-Personas-USA, but it is built around Americans, written in English, and describes life in fairly generic terms, so it can't capture what makes a Vietnamese patient specific, and using it as-is would introduce heavy bias. On top of that, NVIDIA's dataset is missing much of the clinical context a consultation needs, such as the medications a patient is on or their family history.

Meddies Persona was built to solve exactly that. It is a synthetic dataset of 150,000 Vietnamese patient personas, used as context before any consultation or clinical note is generated. Each row is not a finished conversation but a specific person, with their age, occupation, living situation, conditions, current medications, care-seeking habits, and their own way of describing what is wrong.

Every patient is a portrait in their own right, and deserves care shaped to who they are.

How we build a patient persona

We always build the patient first, and only then let the next step generate a consultation from that patient. That order genuinely matters. If you let a model invent the patient while it writes the dialogue, the result sounds like a patient but isn't really anyone. And when you need to generate many of them from a single generic prompt, the personas drift toward a few familiar types: each looks reasonable on its own, but the whole set clusters around a common mould. So we bind every persona to a detailed schema, forcing each patient to differ on the things that actually make a person.

A persona is built in two steps. First, we generate the structured background (demographics, socioeconomic situation, medical history, care-seeking habits) from Vietnam's published population data, along with details that are distinctly Vietnamese, like folk-medicine beliefs. Then a language model writes the free-text parts, such as the chief complaint, history, social barriers, and how the patient communicates, staying close to the background already built so the narration matches the person behind it. Finally, each persona is stamped with its generation time, the model used, and the schema version, so it can be traced and reproduced when needed.

The structured fields from the first step group by function, each answering a question a clinician would care about.

Field group	What it's for
Demographics	Age, sex, ethnicity, province, regional accent. The base every other inference builds on.
Economic & social	Education, occupation, insurance, household size, who cares for them when ill. Decides whether a treatment plan is feasible for the patient.
Lifestyle & medical history	Smoking, alcohol, chronic conditions, allergies, family history. The risk factors that shape the clinical hypothesis.
Environmental exposure	Water source, air quality, pesticide exposure, seasonal field-burning smoke. Distinctly Vietnamese risks that Western data rarely carries.
Care-seeking behavior	Health literacy, herbal-medicine habits, self-medication tendency. Decides how the patient takes advice.
Mental health & social support	Stress level, stressors, family network. The human context a chart usually skips.
Narrative for the model	Reason for the visit and current history, written in natural language to use directly as input.

Put all those groups together, and a persona in the dataset looks like this. Below are four examples of patients in the dataset.

Four personas from Meddies Persona

Age 36FemaleHoa ethnicityBình Dương

Reason for the visit

Stomach pain and bloating, worse after eating

What the persona records

Tap an item with a dot to see how it changes the consultation.

Background

EthnicityHoa, speaks Teochew

Lives inPeri-urban Thủ Dầu Một, Bình Dương

HouseholdMarried, household of 5

Daily life

WorkFarmer

Health

ConditionsPeptic ulcer, acid reflux

OnEsomeprazole, Domperidone, plus herbal medicine

Beliefs & communication

Health literacyLow

Synthetic personas, chosen to illustrate.

How diverse the dataset is

A dataset can be richly detailed and still circle around a handful of similar patients, so we looked at the distributions that matter most across all 150,000 personas.

Personas

150.000

Age range

0–100

Ethnic groups

Condition types

124

Region

North 39.6%

Central 30.2%

South 30.2%

Age

17.5%

0–17

12.0%

18–29

14.9%

30–44

14.9%

45–59

14.9%

60–74

25.8%

75+

Chronic conditions per person

32.5%

22.1%

16.2%

12.0%

17.1%

Ethnicity

Kinh62.6%

Khmer6.3%

Mường4.6%

Thái4.1%

Tày3.6%

Hmong3.5%

Nùng3.1%

Dao2.2%

Hoa2.0%

45 more groups8.0%

Most common conditions

Hypertension8.6%

Dyslipidemia6.1%

Coronary disease5.7%

Refractive error4.9%

Peptic ulcer4.7%

Anxiety disorder4.5%

Allergic rhinitis4.2%

Stroke4.2%

Gingivitis4.1%

Type 2 diabetes3.6%

To keep the dataset both varied and grounded, we reference figures published by the Vietnamese government, the World Health Organization, and the Ministry of Health on demographics and epidemiology. The goal isn't to reproduce the country's exact disease prevalence, but to simulate Vietnamese patients as close to real as the data allows.

What to use it for

The dataset fits work that needs many different kinds of patient: building synthetic consultations, generating intake and history notes, simulating triage, or testing a prompt across a range of Vietnamese patient situations. Each persona is a starting point, not a finished consultation.

You can download the dataset and try it in just a few lines.

python

from datasets import load_dataset

ds = load_dataset("Meddies/meddies-persona-vie", split="train")

print(ds)
print(ds[0].keys())
print(ds[0]["demographics"])

Scope of the release

This is synthetic data, and we don't treat it as real or representative. It doesn't reflect disease rates or the true distribution of the population, and it isn't a clinical tool. Its value is giving you many kinds of patient to start from; whether a scenario ends up realistic still depends on how you design it and review it downstream.

The dataset is released under cc-by-nc-4.0 at Meddies Persona. For commercial use, write to us at [email protected]. And if you come across an implausible persona, an unrealistic combination, or a piece of Vietnamese context we're missing, send it back to us. That's how the next version gets better than this one.

Meddies Persona: Synthetic Vietnamese Patient Profiles

How we build a patient persona

How diverse the dataset is

Region

Age

Chronic conditions per person

Ethnicity

Most common conditions

What to use it for

Scope of the release