File size: 3,944 Bytes
92d8f05
 
4d14d9c
 
 
 
 
 
 
 
 
2d47325
92d8f05
4d14d9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d47325
4d14d9c
 
 
 
 
2d47325
4d14d9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d47325
4d14d9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d47325
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
license: mit
language:
- fr
metrics:
- seqeval
library_name: transformers
pipeline_tag: token-classification
tags:
- medical
- biomedical
- medkit-lib
---


# DrBERT-CASM2

## Model description

**DrBERT-CASM2** is a French Named Entity Recognition model that was fine-tuned from 
[DrBERT](https://huggingface.co/Dr-BERT/DrBERT-4GB-CP-PubMedBERT): A PreTrained model in French for biomedical and clinical domains. 
It has been trained  to detect the following type of entities: **problem**, **treatment** and **test** using the medkit Trainer.

- **Fine-tuned using** medkit [GitHub Repo](https://github.com/TeamHeka/medkit)
- **Developed by** @camila-ud, medkit, HeKA Research team
- **Dataset from** @aneuraz, CASM2

# Intended uses & limitations

## Limitations and bias

This model was trained for **development and test phases**. 
This model is limited by its training dataset, and it should be used with caution. 
The results are not guaranteed, and the model should be used only in data exploration stages.
The model may be able to detect entities in the early stages of the analysis of medical documents in French.

The maximum token size was reduced to **128 tokens** to minimize training time.

# How to use

## Install medkit

First of all, please install medkit with the following command:

```
pip install 'medkit-lib[optional]'
```

Please check the [documentation](https://medkit.readthedocs.io/en/latest/user_guide/install.html) for more info and examples.

## Using the model

```python
from medkit.core.text import TextDocument
from medkit.text.ner.hf_entity_matcher import HFEntityMatcher

matcher = HFEntityMatcher(model="camila-ud/DrBERT-CASM2")
test_doc = TextDocument("Elle souffre d'asthme mais n'a pas besoin d'Allegra")

# detect entities in the raw segment
detected_entities = matcher.run([test_doc.raw_segment]) 
msg = "|".join(f"'{entity.label}':{entity.text}" for entity in detected_entities)
print(f"Text: '{test_doc.text}'\n{msg}")
```
```
Text: "Elle souffre d'asthme mais n'a pas besoin d'Allegra"
'problem':asthme|'treatment':Allegra
```

# Training data 

This model was fine-tuned on **CASM2**, an internal corpus with clinical cases (in french) annotated by master students. 
The corpus contains more than 5000 medkit documents (~ phrases) with entities to detect. 

**Number of documents (~ phrases) by split**

| Split      | # medkit docs |
| ---------- | ------------- |
| Train      | 5824          |
| Validation | 1457          |
| Test       | 1821          |


**Number of examples per entity type**

| Split      | treatment | test | problem |
| ---------- | --------- | ---- | ------- |
| Train      | 3258      | 3990 | 6808    |
| Validation | 842       | 1007 | 1745    |
| Test       | 994       | 1289 | 2113    |

## Training procedure

This model was fine-tuned using the medkit trainer on CPU, it takes about 3h.

# Model perfomances

Model performances computes on CASM2 test dataset (using medkit seqeval evaluator)

Entity|precision|recall|f1
-|-|-|-
treatment|0.7492|0.7666|0.7578
test|0.7449|0.8240|0.7824
problem|0.6884|0.7304|0.7088
Overall|0.7188|0.7660|0.7416

## How to evaluate using medkit
```python
from medkit.text.metrics.ner import SeqEvalEvaluator

# load the matcher and get predicted entities by document
matcher = HFEntityMatcher(model="camila-ud/DrBERT-CASM2")
predicted_entities = [matchers.run([doc.raw_segment]) for doc in test_documents]

# define seqeval evaluator
evaluator  = SeqEvalEvaluator(tagging_scheme="iob2")
evaluator.compute(test_documents,predicted_entities=predicted_entities)
```

# Citation

```
@online{medkit-lib,
  author={HeKA Research Team},
  title={medkit, A Python library for a learning health system},
  url={https://pypi.org/project/medkit-lib/},
  urldate = {2023-07-24}, 
}
```
```
HeKA Research Team, “medkit, a Python library for a learning health system.” https://pypi.org/project/medkit-lib/ (accessed Jul. 24, 2023).
```