KennethTM
/

bert-base-uncased-danish

Inference Endpoints

Model card Files Files and versions Community

KennethTM commited on Dec 22, 2023

Commit

dcd9e8c

·

1 Parent(s): 2ba0508

Create README.md

Files changed (1) hide show

README.md +50 -0

README.md ADDED Viewed

	@@ -0,0 +1,50 @@

+---
+license: mit
+datasets:
+- oscar
+- DDSC/dagw_reddit_filtered_v1.0.0
+- graelo/wikipedia
+language:
+- da
+widget:
+- text: Der var engang en [MASK]
+---
+# What is this?
+A pretrained BERT model (base version, ~110 M parameters) for Danish NLP. The model was not pre-trained from scratch but adapted from the English version.
+# How to use
+Test the model using the pipeline from the [🤗 Transformers](https://github.com/huggingface/transformers) library:
+```python
+from transformers import pipeline
+pipe = pipeline("fill-mask", model="KennethTM/bert-base-uncased-danish")
+pipe("Der var engang en [MASK]")
+```
+Or load it using the Auto* classes:
+```python
+# Load model directly
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained("KennethTM/bert-base-uncased-danish")
+model = AutoModelForMaskedLM.from_pretrained("KennethTM/bert-base-uncased-danish")
+```
+# Model training
+The model is trained using multiple Danish datasets and a context length of 512 tokens.
+The model weights are initialized from the English [bert-base-uncased model](https://huggingface.co/bert-base-uncased) with new word token embeddings created for Danish using [WECHSEL](https://github.com/CPJKU/wechsel).
+Initially, only the word token embeddings are trained using XXXX samples. Finally, the whole model is trained using XXXX samples.
+# Evaluation
+TO DO