Back to Research
May 18, 2026Dataset5 min read

Meddies PII: A Synthetic Multilingual De-identification Dataset for Clinical Text

A machine-generated dataset for detecting and removing personal identifiers in clinical text across 17 languages, released for non-commercial research under CC-BY-NC-4.0.

Meddies Research

Clinical AI research at Meddies

Meddies PII: A Synthetic Multilingual De-identification Dataset for Clinical Text

When you build an LLM for medicine, the model can pass every clinical check and still be unsafe. If it repeats a patient name, a phone number, a record ID, or an address into a log, a prompt, or a downstream training set, trust breaks fast. De-identification is the constraint that sits underneath every other feature. The hard part is not tagging names in one tidy English note. It is keeping the same extraction behavior stable when the language changes, when the document format shifts from a clean FHIR payload to a messy plain-text dump, and when the prompt shape moves from a tagged document to a chat turn.

Most de-identification corpora are either real PHI, which carries its own privacy risk and cannot move between teams, or narrow single-language NER sets that teach a model the template rather than the task. Meddies PII is a public synthetic dataset built to attack that gap directly. It stresses one de-identification task across multilingual medical and administrative documents, English domain-transfer records, mixed training and evaluation views, translation views, and chat-style extraction supervision, all in one Hugging Face repo: huggingface.co/datasets/Meddies/meddies-pii. License is cc-by-nc-4.0.

The design argument: vary the surface, fix the target

How Meddies PII is built: dynamic template axes and locale rules feed LLM generation, then QA and schema normalization, into the language, domain-transfer, split, GRPO, and translation release families.

How Meddies PII is built: dynamic template axes and locale rules feed LLM generation, then QA and schema normalization, into the language, domain-transfer, split, GRPO, and translation release families.

The data is generated by LLMs, but the load-bearing decision is not "use an LLM." It is dynamic templating backed by QA and canonical normalization. A fixed note shape repeated across languages teaches the downstream model the shape. So the same task is varied along document type, paired document label, length bucket, surface format, edge-case brief, and locale-specific naming and address rules. The surface moves while the entity ontology stays still. That forces a model to learn extraction behavior instead of a template.

The generation axes verified in the language-specific family make the variation concrete.

AxisVerified scopeWhy it matters
document_type16 categories in both vietnamese and englishchanges the clinical or administrative context the extractor sees
document_label32 values in vietnamese, 16 in englishthe human-readable label layer is localized instead of being frozen into one naming scheme
document_length3 buckets in both: SHORT, MEDIUM, LONGprevents overfitting to one note size
text_format10 surface formats in bothforces robustness across structured and messy renderings
edge_case20 values in vietnamese, 10 in englishpushes the generator toward harder and less repetitive examples
label ontology7 canonical entity families across representative configskeeps the target stable while the surface varies

The target ontology is deliberately narrow: seven entity families rather than a long, brittle tag list. address, company_name, email_address, human_name, phone_number, id_number, and date. The id_number family absorbs MRNs, SSNs, account numbers, licenses, device IDs, and similar administrative identifiers, so the schema stays small while still covering the identifiers that matter in clinical records. After generation, every record is reviewed and normalized into those seven keys before publishing.

What the repo actually contains

The repo is organized into release families, not one flat dump. Each family answers a different question about de-identification.

FamilyRowsConfigsWhat it gives you
Language-specific tagged documents831,746vietnamese, english, chinese, japanese, korean, thai, french, german, spanish, portuguese, russian, indonesian, filipino, malay, burmese, laos, tamilSingle-language synthetic medical and administrative documents with tagged text, raw text, and canonical label JSON.
Domain-transfer English subsets63,160nvidia-health, nvidia-non-healthEnglish structured and unstructured records with spans, domain, locale, and the same canonical label target.
Mixed training and evaluation bundles76,160train, eval, testTask-oriented mixtures with source provenance for training, evaluation, and smoke tests.
Instruction / GRPO supervision7,100grpo-train, grpo-hard-trainChat-style extraction prompts, normalized JSON answers, and raw source text.
Translation / adaptation view10,000vietnamese-translatedVietnamese adaptation records with input, text, raw, label, and token counts.
BIOES token-classification view162,499pii-bioesDerived no-duplicate Meddies-9 char-span records for BIOES training and validation: 161,999 train rows plus 500 stratified validation rows.

These counts are not additive across the whole repo. Several task-oriented configs are derived views over the underlying source families, not independent corpora. The Vietnamese language config alone holds 45,921 rows. Each record carries the same three-part supervision: text with inline [value]<label> tags, raw with the tags stripped, and label as canonical JSON over the seven families.

One config deserves its own note. The pii-bioes view expands the taxonomy to nine labels, adding private_url and secret, and uses character spans instead of inline tags for BIOES token-classification work. A 2026-05-15 augmentation added 11,554 no-duplicate targeted rows from MiMo generation: 8,228 Vietnamese accepted, 2,768 English accepted, 451 Vietnamese repaired-reject, and 107 English repaired-reject. Rows over the 4,096-token LFM2.5 tokenizer limit are excluded, and 16 risky repaired candidates stay quarantined.

What the trained model shows, and where it lags

A companion extractor was tuned on this data, starting from LiquidAI/LFM2-350M, a small foundation model that fits consumer GPUs and a browser runtime. Training runs full SFT on multilingual extraction, then GRPO alignment with extraction-specific rewards, then exact-match evaluation. The scores below come from the model card, scored on the repo's own eval and test slices using entity-level set-based exact match on (value, label) pairs. They are the authors' reported numbers, not an independent benchmark.

MetricDataset / splitResultNotes
Entity F1Meddies/meddies-pii / eval0.8110Mixed-language validation slice
PrecisionMeddies/meddies-pii / eval0.8112Exact-match entity scoring
RecallMeddies/meddies-pii / eval0.8109Exact-match entity scoring
Entity F1Meddies/meddies-pii / test0.8380Held-out test slice
PrecisionMeddies/meddies-pii / test0.8116Exact-match entity scoring
RecallMeddies/meddies-pii / test0.8663Highest overall headline metric
Value hallucinationeval / test1.31% / 1.35%Generated entity values not found in the input

The headline test F1 of 0.838 hides a wide spread across entity types. Formatting regularity helps the model most. phone_number reaches F1 0.9484 and email_address 0.9252, because their surface form is rigid. company_name collapses to F1 0.3277, a known weak spot the card attributes to label-definition mismatch. Names and addresses sit in the middle, where boundary detection and naming style get messy.

Entity typeF1Reading
phone_number0.9484Strongest class; formatting regularity helps
email_address0.9252Also strong due to rigid surface form
date0.8607Solid despite multilingual date variation
id_number0.8132Usable, but depends on locale formatting
address0.7952Harder because boundary detection is messy
human_name0.7587Sensitive to naming style and nested context
company_name0.3277Known weak spot from label-definition mismatch

The language spread tells the same honest story. The model holds together across all 17 languages, but it is not flat. Malay leads at F1 0.8588, with Korean, Japanese, and Chinese close behind. The lower-resource slices lag: Russian at 0.7117 and Lao at 0.7077 sit at the bottom. Vietnamese, the default config and the project's home language, lands at 0.8251.

What it enables, and where it stops

Use Meddies PII upstream of model tuning, evaluator building, or span-normalization work. Good fits include multilingual de-identification across clinical and administrative formats, span-to-JSON normalization, domain-shift testing with the English non-health documents, and instruction or RL-style supervision. The internal consistency holds across inspected configs: the parsed label JSON always normalized to the same seven canonical keys, so different surfaces resolve to one target ontology.

The limits are stated plainly, and they matter. This is synthetic data, so it is not evidence about real hospital PHI distributions, prevalence, or compliance risk. The task-facing views (train, eval, test, the GRPO sets, and vietnamese-translated) are derived over other families, not independent corpora, so they should not be read as one clean benchmark. The seven-family target is intentionally narrow; a broader PII taxonomy needs extra mapping. On the model side, around 1.3% of generated values are hallucinated rather than copied from the input, nested entities are out of scope, and medical measurements such as blood pressure, labs, dosages, and ages are excluded from the label set by design. None of this certifies any downstream system as private or compliant. That validation still belongs on your own data.

That is the honest shape of the release. It does not claim to be a finished benchmark. It makes a repo that is internally legible: vary the surface hard, hold the target steady, and report where the model still struggles. For teams building healthcare AI under a real privacy constraint, that legibility is the point. License is cc-by-nc-4.0; commercial use starts with a conversation at contact@meddies.ai.