--- language: - fr pipeline_tag: token-classification tags: - medical - ner - nlp - pseudonymisation license: bsd-3-clause library_name: edsnlp model-index: - name: AP-HP/eds-pseudo-public results: - task: type: token-classification dataset: name: AP-HP Pseudo Test type: private metrics: - type: precision name: Token Scores / ADRESSE / Precision value: 0.981694715087097 - type: recall name: Token Scores / ADRESSE / Recall value: 0.9693877551020401 - type: f1 name: Token Scores / ADRESSE / F1 value: 0.975502420419539 - type: recall name: Token Scores / ADRESSE / Redact value: 0.9763848396501451 - type: accuracy name: Token Scores / ADRESSE / Redact Full value: 0.9665697674418601 - type: precision name: Token Scores / DATE / Precision value: 0.9899177066870131 - type: recall name: Token Scores / DATE / Recall value: 0.984285249810339 - type: f1 name: Token Scores / DATE / F1 value: 0.9870934434692821 - type: recall name: Token Scores / DATE / Redact value: 0.9884035981359051 - type: accuracy name: Token Scores / DATE / Redact Full value: 0.859011627906976 - type: precision name: Token Scores / DATE_NAISSANCE / Precision value: 0.9753867791842471 - type: recall name: Token Scores / DATE_NAISSANCE / Recall value: 0.968913726859937 - type: f1 name: Token Scores / DATE_NAISSANCE / F1 value: 0.972139477834238 - type: recall name: Token Scores / DATE_NAISSANCE / Redact value: 0.9933636046105481 - type: accuracy name: Token Scores / DATE_NAISSANCE / Redact Full value: 0.9941860465116271 - type: precision name: Token Scores / IPP / Precision value: 0.918987341772151 - type: recall name: Token Scores / IPP / Recall value: 0.9075000000000001 - type: f1 name: Token Scores / IPP / F1 value: 0.9132075471698111 - type: recall name: Token Scores / IPP / Redact value: 0.985 - type: accuracy name: Token Scores / IPP / Redact Full value: 0.9927325581395341 - type: precision name: Token Scores / MAIL / Precision value: 0.9609144542772861 - type: recall name: Token Scores / MAIL / Recall value: 0.9977029096477791 - type: f1 name: Token Scores / MAIL / F1 value: 0.978963185574755 - type: recall name: Token Scores / MAIL / Redact value: 0.9977029096477791 - type: accuracy name: Token Scores / MAIL / Redact Full value: 0.9970930232558141 - type: precision name: Token Scores / NDA / Precision value: 0.921428571428571 - type: recall name: Token Scores / NDA / Recall value: 0.834951456310679 - type: f1 name: Token Scores / NDA / F1 value: 0.8760611205432931 - type: recall name: Token Scores / NDA / Redact value: 0.87378640776699 - type: accuracy name: Token Scores / NDA / Redact Full value: 0.9723837209302321 - type: precision name: Token Scores / NOM / Precision value: 0.9439770896724531 - type: recall name: Token Scores / NOM / Recall value: 0.9525013545241101 - type: f1 name: Token Scores / NOM / F1 value: 0.948220064724919 - type: recall name: Token Scores / NOM / Redact value: 0.981578472096803 - type: accuracy name: Token Scores / NOM / Redact Full value: 0.895348837209302 - type: precision name: Token Scores / PRENOM / Precision value: 0.9348837209302321 - type: recall name: Token Scores / PRENOM / Recall value: 0.9663461538461531 - type: f1 name: Token Scores / PRENOM / F1 value: 0.950354609929078 - type: recall name: Token Scores / PRENOM / Redact value: 0.99002849002849 - type: accuracy name: Token Scores / PRENOM / Redact Full value: 0.9316860465116271 - type: precision name: Token Scores / SECU / Precision value: 0.882838283828382 - type: recall name: Token Scores / SECU / Recall value: 1 - type: f1 name: Token Scores / SECU / F1 value: 0.9377738825591581 - type: recall name: Token Scores / SECU / Redact value: 1 - type: accuracy name: Token Scores / SECU / Redact Full value: 1.0 - type: precision name: Token Scores / TEL / Precision value: 0.9746407438715131 - type: recall name: Token Scores / TEL / Recall value: 0.9993932564791541 - type: f1 name: Token Scores / TEL / F1 value: 0.9868618136688491 - type: recall name: Token Scores / TEL / Redact value: 0.999479934124989 - type: accuracy name: Token Scores / TEL / Redact Full value: 0.99563953488372 - type: precision name: Token Scores / VILLE / Precision value: 0.96684350132626 - type: recall name: Token Scores / VILLE / Recall value: 0.9376205787781351 - type: f1 name: Token Scores / VILLE / F1 value: 0.9520078354554351 - type: recall name: Token Scores / VILLE / Redact value: 0.9511254019292601 - type: accuracy name: Token Scores / VILLE / Redact Full value: 0.9113372093023251 - type: precision name: Token Scores / ZIP / Precision value: 0.9675036927621861 - type: recall name: Token Scores / ZIP / Recall value: 1 - type: f1 name: Token Scores / ZIP / F1 value: 0.983483483483483 - type: recall name: Token Scores / ZIP / Redact value: 1 - type: accuracy name: Token Scores / ZIP / Redact Full value: 1.0 - type: precision name: Token Scores / micro / Precision value: 0.970393736698084 - type: recall name: Token Scores / micro / Recall value: 0.9783320880510371 - type: f1 name: Token Scores / micro / F1 value: 0.9743467434960551 - type: recall name: Token Scores / micro / Redact value: 0.9884667701208881 - type: accuracy name: Token Scores / micro / Redact Full value: 0.6308139534883721 extra_gated_fields: Organisation: text Intended use of the model: type: select options: - NLP Research - Education - Commercial Product - Clinical Data Warehouse - label: Other value: other ---
[Tests]() [Documentation](https://aphp.github.io/eds-pseudo/latest/) [Codecov](https://codecov.io/gh/aphp/eds-pseudo) [Poetry](https://python-poetry.org) [DVC](https://dvc.org) [Demo](https://eds-pseudo-public.streamlit.app/)
# EDS-Pseudo This project aims at detecting identifying entities documents, and was primarily tested on clinical reports at AP-HP's Clinical Data Warehouse (EDS). The model is built on top of [edsnlp](https://github.com/aphp/edsnlp), and consists in a hybrid model (rule-based + deep learning) for which we provide rules ([`eds-pseudo/pipes`](https://github.com/aphp/eds-pseudo/tree/main/eds_pseudo/pipes)) and a training recipe [`train.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/train.py). We also provide some fictitious templates ([`templates.txt`](https://github.com/aphp/eds-pseudo/blob/main/data/templates.txt)) and a script to generate a synthetic dataset [`generate_dataset.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/generate_dataset.py). The entities that are detected are listed below. | Label | Description | |------------------|---------------------------------------------------------------| | `ADRESSE` | Street address, eg `33 boulevard de Picpus` | | `DATE` | Any absolute date other than a birthdate | | `DATE_NAISSANCE` | Birthdate | | `HOPITAL` | Hospital name, eg `Hôpital Rothschild` | | `IPP` | Internal AP-HP identifier for patients, displayed as a number | | `MAIL` | Email address | | `NDA` | Internal AP-HP identifier for visits, displayed as a number | | `NOM` | Any last name (patients, doctors, third parties) | | `PRENOM` | Any first name (patients, doctors, etc) | | `SECU` | Social security number | | `TEL` | Any phone number | | `VILLE` | Any city | | `ZIP` | Any zip code | ## Downloading the public pre-trained model The public pretrained model is available on the HuggingFace model hub at [AP-HP/eds-pseudo-public](https://hf.co/AP-HP/eds-pseudo-public) and was trained on synthetic data (see [`generate_dataset.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/generate_dataset.py)). You can also test it directly on the **[demo](https://eds-pseudo-public.streamlit.app/)**. 1. Install the latest version of edsnlp ```shell pip install "edsnlp[ml]" -U ``` 2. Get access to the model at [AP-HP/eds-pseudo-public](https://hf.co/AP-HP/eds-pseudo-public) 3. Create and copy a huggingface token with permission **"READ"** at https://huggingface.co/settings/tokens?new_token=true 4. Register the token (only once) on your machine ```python import huggingface_hub huggingface_hub.login(token=YOUR_TOKEN, new_session=False, add_to_git_credential=True) ``` 5. Load the model ```python import edsnlp nlp = edsnlp.load("AP-HP/eds-pseudo-public", auto_update=True) doc = nlp( "En 2015, M. Charles-François-Bienvenu " "Myriel était évêque de Digne. C’était un vieillard " "d’environ soixante-quinze ans ; il occupait le " "siège de Digne depuis 2006." ) for ent in doc.ents: print(ent, ent.label_, str(ent._.date)) ``` To apply the model on many documents using one or more GPUs, refer to the documentation of [edsnlp](https://aphp.github.io/edsnlp/latest/tutorials/multiple-texts/). ## Metrics | AP-HP Pseudo Test Token Scores | Precision | Recall | F1 | Redact | Redact Full | |:---------------------------------|------------:|---------:|-----:|---------:|--------------:| | ADRESSE | 98.2 | 96.9 | 97.6 | 97.6 | 96.7 | | DATE | 99 | 98.4 | 98.7 | 98.8 | 85.9 | | DATE_NAISSANCE | 97.5 | 96.9 | 97.2 | 99.3 | 99.4 | | IPP | 91.9 | 90.8 | 91.3 | 98.5 | 99.3 | | MAIL | 96.1 | 99.8 | 97.9 | 99.8 | 99.7 | | NDA | 92.1 | 83.5 | 87.6 | 87.4 | 97.2 | | NOM | 94.4 | 95.3 | 94.8 | 98.2 | 89.5 | | PRENOM | 93.5 | 96.6 | 95 | 99 | 93.2 | | SECU | 88.3 | 100 | 93.8 | 100 | 100 | | TEL | 97.5 | 99.9 | 98.7 | 99.9 | 99.6 | | VILLE | 96.7 | 93.8 | 95.2 | 95.1 | 91.1 | | ZIP | 96.8 | 100 | 98.3 | 100 | 100 | | micro | 97 | 97.8 | 97.4 | 98.8 | 63.1 | ## Installation to reproduce If you'd like to reproduce eds-pseudo's training or contribute to its development, you should first clone it: ```shell git clone https://github.com/aphp/eds-pseudo.git cd eds-pseudo ``` And install the dependencies. We recommend pinning the library version in your projects, or use a strict package manager like [Poetry](https://python-poetry.org/). ```shell poetry install ``` ## How to use without machine learning ```python import edsnlp nlp = edsnlp.blank("eds") # Some text cleaning nlp.add_pipe("eds.normalizer") # Various simple rules nlp.add_pipe( "eds_pseudo.simple_rules", config={"pattern_keys": ["TEL", "MAIL", "SECU", "PERSON"]}, ) # Address detection nlp.add_pipe("eds_pseudo.addresses") # Date detection nlp.add_pipe("eds_pseudo.dates") # Contextual rules (requires a dict of info about the patient) nlp.add_pipe("eds_pseudo.context") # Apply it to a text doc = nlp( "En 2015, M. Charles-François-Bienvenu " "Myriel était évêque de Digne. C’était un vieillard " "d’environ soixante-quinze ans ; il occupait le " "siège de Digne depuis 2006." ) for ent in doc.ents: print(ent, ent.label_) # 2015 DATE # Charles-François-Bienvenu NOM # Myriel PRENOM # 2006 DATE ``` ## How to train Before training a model, you should update the [configs/config.cfg](https://github.com/aphp/eds-pseudo/blob/main/configs/config.cfg) and [pyproject.toml](https://github.com/aphp/eds-pseudo/blob/main/pyproject.toml) files to fit your needs. Put your data in the `data/dataset` folder (or edit the paths `configs/config.cfg` file to point to `data/gen_dataset/train.jsonl`). Then, run the training script ```shell python scripts/train.py --config configs/config.cfg --seed 43 ``` This will train a model and save it in `artifacts/model-last`. You can evaluate it on the test set (defaults to `data/dataset/test.jsonl`) with: ```shell python scripts/evaluate.py --config configs/config.cfg ``` To package it, run: ```shell python scripts/package.py ``` This will create a `dist/eds-pseudo-aphp-***.whl` file that you can install with `pip install dist/eds-pseudo-aphp-***`. You can use it in your code: ```python import edsnlp # Either from the model path directly nlp = edsnlp.load("artifacts/model-last") # Or from the wheel file import eds_pseudo_aphp nlp = eds_pseudo_aphp.load() ``` ## Documentation Visit the [documentation](https://aphp.github.io/eds-pseudo/) for more information! ## Publication Please find our publication at the following link: https://doi.org/mkfv. If you use EDS-Pseudo, please cite us as below: ``` @article{eds_pseudo, title={Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse}, author={Tannier, Xavier and Wajsb{\"u}rt, Perceval and Calliger, Alice and Dura, Basile and Mouchet, Alexandre and Hilka, Martin and Bey, Romain}, journal={Methods of Information in Medicine}, year={2024}, publisher={Georg Thieme Verlag KG} } ``` ## Acknowledgement We would like to thank [Assistance Publique – Hôpitaux de Paris](https://www.aphp.fr/) and [AP-HP Foundation](https://fondationrechercheaphp.fr/) for funding this project.