The best clinical conversational data would be real doctor-patient transcripts. You cannot use them. Medical records carry protected personal information, and the law rightly keeps them closed. So engineers end up sitting on capable algorithms with no conversational data to train them on.
The wrong way around the wall
The usual workaround is to scrape whatever is loosely available and scrub it afterwards. De-identification is hard, it is never perfect, and a single missed detail re-identifies a patient. Building a privacy program on top of real records means spending forever proving a negative: that nothing sensitive leaked.
We took the opposite path. Instead of removing private information after the fact, we never introduce it in the first place.
Synthetic, but disciplined
meddies-consultant is generated by reasoning models from the ground up. Every persona, every symptom, every turn of dialogue is synthetic. There is no real patient behind any record, so there is nothing to de-identify and nothing to leak.
That does not mean the data is loose. Synthetic generation done carelessly produces fluent nonsense: agreeable, medically shallow, structurally wrong. We avoid that by boxing the generator inside clinical frameworks and a review gate that filters for clinical safety. The control we gain over accuracy is the upside of synthesis. The privacy guarantee is the floor it stands on.
What "by construction" buys you
Privacy by construction is stronger than privacy by cleanup. A cleanup pipeline is only as good as its worst miss. A construction that never touches real data has no miss to make.
For Vietnamese hospitals, where patient-data handling is both a legal obligation and a trust question, that distinction matters. The dataset that trains a model can be opened, inspected, and shared without putting a single real patient at risk. That is the point.
