Privacy-preserving mimic models for clinical named entity recognition in French

In this paper, we propose a Privacy-Preserving Mimic Models architecture that enables the generation of shareable models using the mimic learning approach. The idea of mimic learning is to annotate unlabeled public data through a private teacher model trained on the original sensitive data. The newly labeled public dataset is then used to train the student models. These generated student models could be shared without sharing the data itself or exposing the private teacher model that was directly built on this data.

CAS Privacy-Preserving Named Entity Recognition (NER) Mimic Model

To generate the CAS Privacy-Preserving Mimic Model, we used a private teacher model to annotate the unlabeled CAS clinical French corpus. The private teacher model is an NER model trained on the MERLOT clinical corpus and could not be shared. Using the produced silver annotations, we train the CAS student model, namely the CAS Privacy-Preserving NER Mimic Model. This model might be viewed as a knowledge transfer process between the teacher and the student model in a privacy-preserving manner.

We share only the weights of the CAS student model, which is trained on silver-labeled publicly released data. We argue that no potential attack could reveal information about sensitive private data using the silver annotations generated by the private teacher model on publicly available non-sensitive data.

Our model is constructed based on CamemBERT model using the Natural language structuring (NLstruct) library that implements NER models that handle nested entities.

Download the CAS Privacy-Preserving NER Mimic Model

  fasttext_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model_fasttext.txt")
  urllib.request.urlretrieve(fasttext_url, fasttext_url.split('/')[-1])
  model_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model.ckpt")
  urllib.request.urlretrieve(model_url, "path/to/your/folder/"+ model_url.split('/')[-1])
  path_checkpoint = "path/to/your/folder/"+ model_url.split('/')[-1]

1. Load and use the model using only NLstruct

NLstruct is the Python library we used to generate our CAS privacy-preserving NER mimic model and that handles nested entities.

Install the NLstruct library

  pip install nlstruct==0.1.0

Use the model

  from nlstruct import load_pretrained
  from nlstruct.datasets import load_from_brat, export_to_brat
  
  ner_model = load_pretrained(path_checkpoint)
  test_data = load_from_brat("path/to/brat/test")
  test_predictions = ner_model.predict(test_data)
  # Export the predictions into the BRAT standoff format
  export_to_brat(test_predictions, filename_prefix="path/to/exported_brat")

2. Load the model using NLstruct and use it with the Medkit library

Medkit is a Python library for facilitating the extraction of features from various modalities of patient data, including textual data.

Install the Medkit library

  python -m pip install 'medkit-lib'

Use the model

Our model could be implemented as a Medkit operation module as follows:

import os
from nlstruct import load_pretrained
import urllib.request
from huggingface_hub import hf_hub_url

from medkit.io.brat import BratInputConverter, BratOutputConverter
from medkit.core import Attribute
from medkit.core.text import NEROperation,Entity,Span,Segment, span_utils

class CAS_matcher(NEROperation):
    def __init__(self):	
        # Load the fasttext file
        fasttext_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model_fasttext.txt")
        if not os.path.exists("CAS-privacy-preserving-model_fasttext.txt"):
            urllib.request.urlretrieve(fasttext_url, fasttext_url.split('/')[-1])
        # Load the model
        model_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model.ckpt")
        if not os.path.exists("ner_model/CAS-privacy-preserving-model.ckpt"):
            urllib.request.urlretrieve(model_url, "ner_model/"+ model_url.split('/')[-1])
        path_checkpoint = "ner_model/"+ model_url.split('/')[-1]
        
        self.model = load_pretrained(path_checkpoint)
        self.model.eval()

    def run(self, segments):
        """Return entities for each match in `segments`.

        Parameters
        ----------
        segments:
            List of segments into which to look for matches.

        Returns
        -------
        List[Entity]
            Entities found in `segments`.
        """
        # get an iterator to all matches, grouped by segment
        entities = []
        for segment in segments:
            matches = self.model.predict({"doc_id":segment.uid,"text":segment.text})
            entities.extend([entity
            for entity in self._matches_to_entities(matches, segment)
            ])
        return entities

    def _matches_to_entities(self, matches, segment: Segment):    
        for match in matches["entities"]:
            text_all,spans_all = [],[]
            
            for fragment in match["fragments"]:
                text, spans = span_utils.extract(
                    segment.text, segment.spans, [(fragment["begin"], fragment["end"])]
                )
                text_all.append(text)
                spans_all.extend(spans)

            text_all = "".join(text_all)
            entity = Entity(
                label=match["label"],
                text=text_all,
                spans=spans_all,
            )

            score_attr = Attribute(
                label="confidence",
                value=float(match["confidence"]),
                #metadata=dict(model=self.model.path_checkpoint),
            )
            entity.attrs.add(score_attr)
            yield entity

brat_converter = BratInputConverter()
docs = brat_converter.load("path/to/brat/test")
matcher = CAS_matcher()
for doc in docs:
   entities = matcher.run([doc.raw_segment])  
   for ent in entities:
       doc.anns.add(ent)
brat_output_converter = BratOutputConverter(attrs=[])
# To keep the same document names in the output folder
doc_names = [os.path.splitext(os.path.basename(doc.metadata["path_to_text"]))[0] for doc in docs]
brat_output_converter.save(docs, dir_path="path/to/exported_brat, doc_names=doc_names)

Environmental Impact

Carbon emissions are estimated using the Carbontracker tool. The used version at the time of our experiments computes its estimates by using the average carbon intensity in European Union in 2017 instead of the France value (294.21 gCO2eq/kWh vs. 85 gCO2eq/kWh). Therefore, our reported carbon footprint of training both the private model that generated the silver annotations and the CAS student model is overestimated.

  • Hardware Type: GPU NVIDIA GTX 1080 Ti
  • Compute Region: Gif-sur-Yvette, Île-de-France, France
  • Carbon Emitted: 292 gCO2eq

Acknowledgements

We thank the institutions and colleagues who made it possible to use the datasets described in this study: the Biomedical Informatics Department at the Rouen University Hospital provided access to the LERUDI corpus, and Dr. Grabar (Université de Lille, CNRS, STL) granted permission to use the DEFT/CAS corpus. We would also like to thank the ITMO Cancer Aviesan for funding our research, and the HeKA research team for integrating our model into their library Medkit.

Citation

If you use this model in your research, please make sure to cite our paper:

@article{BANNOUR2022104073,  
title = {Privacy-preserving mimic models for clinical named entity recognition in French},  
journal = {Journal of Biomedical Informatics},  
volume = {130},  
pages = {104073},  
year = {2022},  
issn = {1532-0464},  
doi = {https://doi.org/10.1016/j.jbi.2022.104073},  
url = {https://www.sciencedirect.com/science/article/pii/S1532046422000892}}  
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Inference API (serverless) has been turned off for this model.

Dataset used to train NesrineBannour/CAS-privacy-preserving-model