KennethTM commited on
Commit
dcd9e8c
·
1 Parent(s): 2ba0508

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -0
README.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - oscar
5
+ - DDSC/dagw_reddit_filtered_v1.0.0
6
+ - graelo/wikipedia
7
+ language:
8
+ - da
9
+ widget:
10
+ - text: Der var engang en [MASK]
11
+ ---
12
+
13
+ # What is this?
14
+
15
+ A pretrained BERT model (base version, ~110 M parameters) for Danish NLP. The model was not pre-trained from scratch but adapted from the English version.
16
+
17
+ # How to use
18
+
19
+ Test the model using the pipeline from the [🤗 Transformers](https://github.com/huggingface/transformers) library:
20
+
21
+ ```python
22
+ from transformers import pipeline
23
+
24
+ pipe = pipeline("fill-mask", model="KennethTM/bert-base-uncased-danish")
25
+
26
+ pipe("Der var engang en [MASK]")
27
+ ```
28
+
29
+ Or load it using the Auto* classes:
30
+
31
+ ```python
32
+ # Load model directly
33
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
34
+
35
+ tokenizer = AutoTokenizer.from_pretrained("KennethTM/bert-base-uncased-danish")
36
+ model = AutoModelForMaskedLM.from_pretrained("KennethTM/bert-base-uncased-danish")
37
+ ```
38
+
39
+ # Model training
40
+
41
+ The model is trained using multiple Danish datasets and a context length of 512 tokens.
42
+
43
+ The model weights are initialized from the English [bert-base-uncased model](https://huggingface.co/bert-base-uncased) with new word token embeddings created for Danish using [WECHSEL](https://github.com/CPJKU/wechsel).
44
+
45
+ Initially, only the word token embeddings are trained using XXXX samples. Finally, the whole model is trained using XXXX samples.
46
+
47
+
48
+ # Evaluation
49
+
50
+ TO DO