In medicine, an AI model does not need to know who the patient is to reason clinically. It needs the symptoms, the drugs, the lab results, and the course of treatment.

The problem is that in a medical record those facts usually sit right next to the patient's name, a record ID, an insurance number, an address, or a relative's phone. To a doctor, all of it is one complete record. To an AI system, the identifiers have to be separated out before the data moves downstream. That step is called de-identification.

If it goes wrong, identifiers can leak into a system log, an evaluation set, a training set, or an outside vendor's infrastructure. In healthcare that is not a small technical bug. It is PII/PHI leakage, and in many settings it is prohibited by policy, contract, and regulation.

A clinical AI system can reason about medicine very well and still be impossible to deploy in a hospital if it cannot protect patient identifiers. The hard part is not finding one name in one tidy English note. It is keeping the same extraction behavior stable when the language changes, when a clean FHIR payload becomes a messy plain-text dump, and when a tagged document becomes a chat turn.

Meddies PII is built for that problem. It is an open research model for multilingual medical de-identification. We release the model, the training data, and the training process so that the handling of medical data can be checked in the open.

Why public PII data isn't enough for medicine

A de-identification model only works well when it has been trained on the kind of data it will actually meet. And medical data in a hospital rarely arrives as one clean paragraph.

Data from the same patient can live in nursing notes, lab forms, administrative records, or JSON/XML files exported by hospital systems. Because these records contain protected health information (PHI), real examples can almost never be shared openly for the community to inspect.

Open entity-recognition datasets like NVIDIA Nemotron-PII or ai4privacy/pii-masking-300k are easier to share, but they are usually built from internet text, chat, or generic documents. A model can learn to find names, dates, phone numbers, and addresses in ordinary prose, but rarely has to separate identifiers from the clinical facts that must stay. General PII recognition is not enough for medical de-identification.

The example below shows that difference. On the left is a sample from a general PII set, where the main task is to find identifiers in administrative prose. On the right is a Meddies PII sample: a nursing care form stored as JSON, where identifiers sit next to drugs, vitals, and follow-up notes that must stay after de-identification.

A general administrative PII sample compared with a nursing care note stored as JSON from Meddies PII. The Meddies sample places identifiers next to drugs, vitals, and symptoms that must survive de-identification.

In real hospitals, the data is messier: Vietnamese names and foreign names, abbreviated addresses, insurance numbers, record IDs, department names, a relative's phone number, dates of birth, admission dates, lab values, and dosages can all sit inside a few rushed lines written during diagnosis. A de-identification model trained only on generic data will have a hard time separating the identifiers from the clinical facts that need to stay.

Studies on multilingual clinical de-identification transfer report the same problem. A system that learns mostly from large English corpora tends to hold up poorly when it meets mixed-language or lower-resource clinical text. For Meddies the practical question is how to build a data foundation open enough for others to check, without ever touching real patient records.

How we build the Meddies PII data

Real hospital data cannot be shared widely, so synthetic data is a reasonable choice. But in medical de-identification, more samples do not automatically make a good training set. Each sample has to teach the model the right behavior: find the identifier span, keep the clinical facts, and return a stable label.

Karpathy pointed out a risk of LLM-generated synthetic data: individual samples can look reasonable, while the whole dataset clusters around a few sentence shapes and familiar templates. For this task, that means the model sees a lot of text but still only sees a narrow slice of medical records.

With one fixed prompt, even at high temperature, generated data still falls back to familiar patterns: the same note type, the same length, the same ways of naming people, and the same few record ID formats. The dataset looks bigger, but real diversity does not grow much; the risk is model collapse, where the model learns repeated templates instead of the breadth of medical data.

So when we build Meddies PII, we use dynamic prompting: each generation uses a different input, varying language, document type, document label, length, format, edge case, and conventions such as date, address, phone number, or record ID formats.

The goal of the dataset is also the guiding principle of Meddies Research.

Leave No One Behind

Not just more data, but real breadth: many languages, many contexts, and many document formats, so that no setting or patient group is left behind.

Prompt template

generate

Raw data

rule filter

LLM review

Filtered data

pass

Meddies PII

The axes varied on each generation:

Axis	Scope	Purpose
`language`	17 languages	So the model serves more languages, and more patients.
`document_type`	16 categories	So the model handles many document contexts, not just one.
`document_label`	48 distinct values	So labels stay close to each language instead of one frozen scheme.
`document_length`	3 levels: `SHORT`, `MEDIUM`, `LONG`	So the model doesn't lock onto one note length.
`text_format`	10 formats	So the model reads every shape hospital data comes in, from prose to messy notes.
`edge_case`	30 values	So the model holds up on rare, hard inputs.
`label`	7 entity families: `human_name`, `date`, `id_number`, `phone_number`, `email_address`, `address`, `company_name`	So the extraction target stays stable as language and format change.

These seven labels follow the PII/PHI entity groups in de-identification references like Microsoft Azure's entity categories and HHS. The aim is to cover the identifiers that matter most while keeping the label set small enough to stay stable across many languages and formats.

Because this is medical data, not every sensitive field should be removed. Age, blood pressure, lab results, and dosages carry clinical meaning and must be kept. If a de-identification model erases creatinine 86 µmol/L or metformin 500 mg, the processed record loses its value for everything downstream.

Miss a date of birth and a patient's information leaks. Erase a dosage and a clinical fact disappears. De-identification is not about removing as much as possible. It is about separating the identifiers and keeping the clinical record intact.

How we control data errors

After generation, each sample goes through a validation step that combines rule-based checks with LLM review.

Step	What it checks	Error it is meant to catch
Rule-based check	Sample structure, text length, entity count, allowed labels, and repeated content.	Wrong format, missing entities, odd labels, text that is too short or too long, or abnormal repetition.
Language check	Whether the text contains scripts or language cues that do not match the generation setup.	A sample tagged as one language but carrying strong signs of another.
Extracted-value check	Every labeled value must appear in the original text, not outside it or rewritten.	Out-of-source values, rewritten values, empty values, or over-long values.
Per-label check	Label-specific rules for names, dates, IDs, phone numbers, email addresses, addresses, and organizations.	Drugs, dosages, lab results, vitals, age, titles, or standalone times mistaken for identifiers.
LLM review	An LLM reads the original text with the draft labels to keep valid values, add missed spans, and remove incorrect ones.	Missed entities, wrong labels, repeated values, or clinical facts mistaken for identifiers.

In this pipeline, rule-based checks and LLM review handle two different groups of errors. The rules handle errors with clear signatures, such as malformed samples, odd labels, or values that do not appear in the original text. The LLM reviews cases that require context, such as missing entities, over-broad labels, or missed organization names.

Because we cannot control every language perfectly, we expect the dataset to still contain errors. That is why we publish the full dataset for public review.

If you use it and find remaining errors, we would welcome feedback.

A model small enough to run in the hospital

Alongside the data, we trained a small extractor of the same name, based on LiquidAI/LFM2-350M.

We chose this model because it has only 350 million parameters and uses short convolution layers, light enough to run on a CPU and for a hospital to self-host on existing hardware without an expensive GPU.

At that size we can even run it directly in the browser. You can try the live demo here.

How we train the model

Training goes through two steps. First, supervised fine-tuning (SFT): the model learns from the text-and-label pairs in the dataset, getting used to reading a medical document and returning the right identifier spans as JSON.

Then a reinforcement-learning step with GRPO (Group Relative Policy Optimization). The model is rewarded for extracting correctly and penalized for inventing values, mislabeling, or missing identifiers.

The GRPO step focuses on the hard cases: chat-style prompts, multilingual text, and the formats most likely to trip the model up. The goal is not a pretty score, but extraction behavior that stays stable when the input changes.

Entity F1

0.8380

Recall

0.8663

Copied-value error

1.35%

Entity F1

0.948

phone

0.925

0.861

date

0.813

0.795

address

0.759

name

0.328

company

Language F1, all 17 languages

0.859

Malay

0.854

Korean

0.850

Japanese

0.846

Chinese

0.825

Vietnamese

0.813

Filipino

0.808

Indonesian

0.785

Burmese

0.780

Portuguese

0.777

Spanish

0.774

Tamil

0.762

French

0.753

English

0.738

German

0.730

Thai

0.712

Russian

0.708

Lao

Failure modes to check first

An F1 score doesn't tell you how the model fails. In medical de-identification, the failure mode is what decides the consequence, so we look at each error type instead of one aggregate number. These are the ones we watch most closely.

Error type	Example	Consequence
Missed identifier	A hospital or department name not caught as `company_name`.	An organizational identifier stays in the text and can affect the audit trail.
Mistaken clinical fact	`BP 120/80`, `creatinine 86 µmol/L`, or `500 mg` treated as `id_number` or `date`.	The de-identified text loses facts needed for reasoning.
Dates	An admission date or a test date handled like a date of birth.	The care timeline gets broken.
Non-routine input	An email written as `nguyen dot an at...` or a phone number split with symbols.	Needs a separate challenge set before talking about recall.
Language leak	A synthetic sample of one language mixed with characters or phrases of another.	A data slice looks larger than its real quality.
Nested entities	A department name inside a longer hospital name.	The current schema does not handle overlapping spans yet.

The same F1 score can hide very different failures. In medical de-identification, the difference between error types is what determines the risk: whether an identifier remains exposed, or whether a clinical fact disappears from the record.

Scope of the release

Meddies PII is not a finished de-identification product. The model only extracts structured identifier spans to feed into a de-identification process. In real operation, a hospital still needs policy, audit logs, local validation, a path to a human when risk is high, and deployment controls.

This release also does not prove that synthetic data matches data from real hospitals. Synthetic data gives a shareable training signal, but it can still distort frequencies, style, and rare patterns. If your hospital uses its own abbreviations, insurance codes, or very different department names, validate on your own data before trusting the model.

The seven-label set is an initial scope, not a complete labeling scheme for all privacy-protection requirements. We deliberately keep the label set small so it remains easier to inspect and stable across languages; its scope will expand gradually as the model is tested in real hospital environments. We hope this release gives research teams a strong starting point to build on.

Data and model are released under CC-BY-NC-4.0 for non-commercial research. For commercial use, or to review a specific language slice together, write to us at [email protected].

Meddies PII: An Open Multilingual De-identification Model for Clinical Text

Why public PII data isn't enough for medicine

How we build the Meddies PII data

How we control data errors

A model small enough to run in the hospital

How we train the model

Entity F1

Language F1, all 17 languages

Failure modes to check first

Scope of the release

References