Model Description
- A ClinicalBERT [Alsentzer et al., 2019] model fine-tuned for de-identification of medical notes.
- Sequence Labeling (token classification): The model was trained to predict protected health information (PHI/PII) entities (spans). A list of protected health information categories is given by HIPAA.
- A token can either be classified as non-PHI or as one of the 11 PHI types. Token predictions are aggregated to spans by making use of BILOU tagging.
- The PHI labels that were used for training and other details can be found here: Annotation Guidelines
- More details on how to use this model, the format of data and other useful information is present in the GitHub repo: Robust DeID.
How to use
- A demo on how the model works (using model predictions to de-identify a medical note) is on this space: Medical-Note-Deidentification.
- Steps on how this model can be used to run a forward pass can be found here: Forward Pass
- In brief, the steps are:
- Sentencize (the model aggregates the sentences back to the note level) and tokenize the dataset.
- Use the predict function of this model to gather the predictions (i.e., predictions for each token).
- Additionally, the model predictions can be used to remove PHI from the original note/text.
Dataset
- The I2B2 2014 [Stubbs and Uzuner, 2015] dataset was used to train this model.
I2B2 | I2B2 | |||
---|---|---|---|---|
TRAIN SET - 790 NOTES | TEST SET - 514 NOTES | |||
PHI LABEL | COUNT | PERCENTAGE | COUNT | PERCENTAGE |
DATE | 7502 | 43.69 | 4980 | 44.14 |
STAFF | 3149 | 18.34 | 2004 | 17.76 |
HOSP | 1437 | 8.37 | 875 | 7.76 |
AGE | 1233 | 7.18 | 764 | 6.77 |
LOC | 1206 | 7.02 | 856 | 7.59 |
PATIENT | 1316 | 7.66 | 879 | 7.79 |
PHONE | 317 | 1.85 | 217 | 1.92 |
ID | 881 | 5.13 | 625 | 5.54 |
PATORG | 124 | 0.72 | 82 | 0.73 |
4 | 0.02 | 1 | 0.01 | |
OTHERPHI | 2 | 0.01 | 0 | 0 |
TOTAL | 17171 | 100 | 11283 | 100 |
Training procedure
Steps on how this model was trained can be found here: Training. The "model_name_or_path" was set to: "emilyalsentzer/Bio_ClinicalBERT".
- The dataset was sentencized with the en_core_sci_sm sentencizer from spacy.
- The dataset was then tokenized with a custom tokenizer built on top of the en_core_sci_sm tokenizer from spacy.
- For each sentence we added 32 tokens on the left (from previous sentences) and 32 tokens on the right (from the next sentences).
- The added tokens are not used for learning - i.e, the loss is not computed on these tokens - they are used as additional context.
- Each sequence contained a maximum of 128 tokens (including the 32 tokens added on). Longer sequences were split.
- The sentencized and tokenized dataset with the token level labels based on the BILOU notation was used to train the model.
- The model is fine-tuned from a pre-trained RoBERTa model.
Training details:
- Input sequence length: 128
- Batch size: 32
- Optimizer: AdamW
- Learning rate: 4e-5
- Dropout: 0.1
Results
Questions?
Post a Github issue on the repo: Robust DeID.
- Downloads last month
- 2,750
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.