When you build an LLM for medicine, the model can pass every clinical check and still be unsafe. If it repeats a patient name, a phone number, a record ID, or an address into a log, a prompt, or a downstream training set, trust breaks fast. De-identification is the constraint that sits underneath every other feature. The hard part is not tagging names in one tidy English note. It is keeping the same extraction behavior stable when the language changes, when the document format shifts from a clean FHIR payload to a messy plain-text dump, and when the prompt shape moves from a tagged document to a chat turn.
Most de-identification corpora are either real PHI, which carries its own privacy risk and cannot move between teams, or narrow single-language NER sets that teach a model the template rather than the task. Meddies PII is a public synthetic dataset built to attack that gap directly. It stresses one de-identification task across multilingual medical and administrative documents, English domain-transfer records, mixed training and evaluation views, translation views, and chat-style extraction supervision, all in one Hugging Face repo: huggingface.co/datasets/Meddies/meddies-pii. License is cc-by-nc-4.0.
The design argument: vary the surface, fix the target

How Meddies PII is built: dynamic template axes and locale rules feed LLM generation, then QA and schema normalization, into the language, domain-transfer, split, GRPO, and translation release families.
The data is generated by LLMs, but the load-bearing decision is not "use an LLM." It is dynamic templating backed by QA and canonical normalization. A fixed note shape repeated across languages teaches the downstream model the shape. So the same task is varied along document type, paired document label, length bucket, surface format, edge-case brief, and locale-specific naming and address rules. The surface moves while the entity ontology stays still. That forces a model to learn extraction behavior instead of a template.
The generation axes verified in the language-specific family make the variation concrete.
| Axis | Verified scope | Why it matters |
|---|---|---|
document_type | 16 categories in both vietnamese and english | changes the clinical or administrative context the extractor sees |
document_label | 32 values in vietnamese, 16 in english | the human-readable label layer is localized instead of being frozen into one naming scheme |
document_length | 3 buckets in both: SHORT, MEDIUM, LONG | prevents overfitting to one note size |
text_format | 10 surface formats in both | forces robustness across structured and messy renderings |
edge_case | 20 values in vietnamese, 10 in english | pushes the generator toward harder and less repetitive examples |
label ontology | 7 canonical entity families across representative configs | keeps the target stable while the surface varies |
The target ontology is deliberately narrow: seven entity families rather than a long, brittle tag list. address, company_name, email_address, human_name, phone_number, id_number, and date. The id_number family absorbs MRNs, SSNs, account numbers, licenses, device IDs, and similar administrative identifiers, so the schema stays small while still covering the identifiers that matter in clinical records. After generation, every record is reviewed and normalized into those seven keys before publishing.
What the repo actually contains
The repo is organized into release families, not one flat dump. Each family answers a different question about de-identification.
| Family | Rows | Configs | What it gives you |
|---|---|---|---|
| Language-specific tagged documents | 831,746 | vietnamese, english, chinese, japanese, korean, thai, french, german, spanish, portuguese, russian, indonesian, filipino, malay, burmese, laos, tamil | Single-language synthetic medical and administrative documents with tagged text, raw text, and canonical label JSON. |
| Domain-transfer English subsets | 63,160 | nvidia-health, nvidia-non-health | English structured and unstructured records with spans, domain, locale, and the same canonical label target. |
| Mixed training and evaluation bundles | 76,160 | train, eval, test | Task-oriented mixtures with source provenance for training, evaluation, and smoke tests. |
| Instruction / GRPO supervision | 7,100 | grpo-train, grpo-hard-train | Chat-style extraction prompts, normalized JSON answers, and raw source text. |
| Translation / adaptation view | 10,000 | vietnamese-translated | Vietnamese adaptation records with input, text, raw, label, and token counts. |
| BIOES token-classification view | 162,499 | pii-bioes | Derived no-duplicate Meddies-9 char-span records for BIOES training and validation: 161,999 train rows plus 500 stratified validation rows. |
These counts are not additive across the whole repo. Several task-oriented configs are derived views over the underlying source families, not independent corpora. The Vietnamese language config alone holds 45,921 rows. Each record carries the same three-part supervision: text with inline [value]<label> tags, raw with the tags stripped, and label as canonical JSON over the seven families.
One config deserves its own note. The pii-bioes view expands the taxonomy to nine labels, adding private_url and secret, and uses character spans instead of inline tags for BIOES token-classification work. A 2026-05-15 augmentation added 11,554 no-duplicate targeted rows from MiMo generation: 8,228 Vietnamese accepted, 2,768 English accepted, 451 Vietnamese repaired-reject, and 107 English repaired-reject. Rows over the 4,096-token LFM2.5 tokenizer limit are excluded, and 16 risky repaired candidates stay quarantined.
What the trained model shows, and where it lags
A companion extractor was tuned on this data, starting from LiquidAI/LFM2-350M, a small foundation model that fits consumer GPUs and a browser runtime. Training runs full SFT on multilingual extraction, then GRPO alignment with extraction-specific rewards, then exact-match evaluation. The scores below come from the model card, scored on the repo's own eval and test slices using entity-level set-based exact match on (value, label) pairs. They are the authors' reported numbers, not an independent benchmark.
| Metric | Dataset / split | Result | Notes |
|---|---|---|---|
| Entity F1 | Meddies/meddies-pii / eval | 0.8110 | Mixed-language validation slice |
| Precision | Meddies/meddies-pii / eval | 0.8112 | Exact-match entity scoring |
| Recall | Meddies/meddies-pii / eval | 0.8109 | Exact-match entity scoring |
| Entity F1 | Meddies/meddies-pii / test | 0.8380 | Held-out test slice |
| Precision | Meddies/meddies-pii / test | 0.8116 | Exact-match entity scoring |
| Recall | Meddies/meddies-pii / test | 0.8663 | Highest overall headline metric |
| Value hallucination | eval / test | 1.31% / 1.35% | Generated entity values not found in the input |
The headline test F1 of 0.838 hides a wide spread across entity types. Formatting regularity helps the model most. phone_number reaches F1 0.9484 and email_address 0.9252, because their surface form is rigid. company_name collapses to F1 0.3277, a known weak spot the card attributes to label-definition mismatch. Names and addresses sit in the middle, where boundary detection and naming style get messy.
| Entity type | F1 | Reading |
|---|---|---|
phone_number | 0.9484 | Strongest class; formatting regularity helps |
email_address | 0.9252 | Also strong due to rigid surface form |
date | 0.8607 | Solid despite multilingual date variation |
id_number | 0.8132 | Usable, but depends on locale formatting |
address | 0.7952 | Harder because boundary detection is messy |
human_name | 0.7587 | Sensitive to naming style and nested context |
company_name | 0.3277 | Known weak spot from label-definition mismatch |
The language spread tells the same honest story. The model holds together across all 17 languages, but it is not flat. Malay leads at F1 0.8588, with Korean, Japanese, and Chinese close behind. The lower-resource slices lag: Russian at 0.7117 and Lao at 0.7077 sit at the bottom. Vietnamese, the default config and the project's home language, lands at 0.8251.
What it enables, and where it stops
Use Meddies PII upstream of model tuning, evaluator building, or span-normalization work. Good fits include multilingual de-identification across clinical and administrative formats, span-to-JSON normalization, domain-shift testing with the English non-health documents, and instruction or RL-style supervision. The internal consistency holds across inspected configs: the parsed label JSON always normalized to the same seven canonical keys, so different surfaces resolve to one target ontology.
The limits are stated plainly, and they matter. This is synthetic data, so it is not evidence about real hospital PHI distributions, prevalence, or compliance risk. The task-facing views (train, eval, test, the GRPO sets, and vietnamese-translated) are derived over other families, not independent corpora, so they should not be read as one clean benchmark. The seven-family target is intentionally narrow; a broader PII taxonomy needs extra mapping. On the model side, around 1.3% of generated values are hallucinated rather than copied from the input, nested entities are out of scope, and medical measurements such as blood pressure, labs, dosages, and ages are excluded from the label set by design. None of this certifies any downstream system as private or compliant. That validation still belongs on your own data.
That is the honest shape of the release. It does not claim to be a finished benchmark. It makes a repo that is internally legible: vary the surface hard, hold the target steady, and report where the model still struggles. For teams building healthcare AI under a real privacy constraint, that legibility is the point. License is cc-by-nc-4.0; commercial use starts with a conversation at contact@meddies.ai.
