abazoge commited on
Commit
b7c9925
·
verified ·
1 Parent(s): 768c976

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -0
README.md ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Dr-BERT/NACHOS
5
+ language:
6
+ - fr
7
+ library_name: transformers
8
+ tags:
9
+ - biomedical
10
+ - medical
11
+ - clinical
12
+ - life science
13
+ ---
14
+ # DrLongformer
15
+
16
+ <span style="font-size:larger;">**DrLongformer**</span> is a French pretrained Longformer model based on Clinical-Longformer that was further pretrained on the NACHOS dataset (same dataset as [DrBERT](https://github.com/qanastek/DrBERT)). This model allows up to 4,096 tokens as input. DrLongformer consistently outperforms medical BERT-based models across most downstream tasks regardless of sequence length, except on NER tasks. Evaluated downstream tasks cover named entity recognition (NER), question answering (MCQA), Semantic textual similarity (STS) and text classification tasks (CLS). For more details, please refer to our paper: [Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study]().
17
+
18
+ ### Model pretraining
19
+ We explored multiple strategies for the adaptation of Longformer models to the French medical domain:
20
+ - Further pretraining of English clinical Longformer on French medical data.
21
+ - Converting a French medical BERT model to the Longformer architecture.
22
+ - Pretraining a Longformer from scratch on French medical data.
23
+
24
+ All Pretraining scripts to reproduce the experiments are available in this Github repository: [DrLongformer](https://github.com/abazoge/DrLongformer).
25
+ For the `from scratch` and `further pretraining` strategies, the training scripts are the same as [DrBERT](https://github.com/qanastek/DrBERT), only the bash scripts are different and available in this repository.
26
+
27
+ All models were trained on the [Jean Zay](http://www.idris.fr/jean-zay/) French supercomputer.
28
+
29
+ | Model name | Corpus | Pretraining strategy | Sequence Length | Model URL |
30
+ | :------: | :---: | :---: | :---: | :---: |
31
+ | `DrLongformer` | NACHOS 7 GB | Further pretraining of [Clinical-Longformer](https://huggingface.co/yikuan8/Clinical-Longformer) | 4096 | [HuggingFace](https://huggingface.co/abazoge/DrLongformer) |
32
+ | `DrBERT-4096` | NACHOS 7 GB | Conversion of [DrBERT-7B](https://huggingface.co/Dr-BERT/DrBERT-7GB) to the Longformer architecture | 4096 | [HuggingFace](https://huggingface.co/abazoge/DrBERT-4096) |
33
+ | `DrLongformer-FS (from scratch)` | NACHOS 7 GB | Pretraining from scratch | 4096 | Not available |
34
+
35
+
36
+ ### Model Usage
37
+ You can use DrLongformer directly from [Hugging Face's Transformers](https://github.com/huggingface/transformers):
38
+ ```python
39
+ # !pip install transformers
40
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
41
+ tokenizer = AutoTokenizer.from_pretrained("abazoge/DrLongformer")
42
+ model = AutoModelForMaskedLM.from_pretrained("abazoge/DrLongformer")
43
+ ```
44
+
45
+ ### Citation
46
+ ```