Updated README
Browse files
README.md
CHANGED
@@ -6,13 +6,26 @@ language:
|
|
6 |
|
7 |
# Latvian BERT base model (cased)
|
8 |
|
9 |
-
A
|
10 |
It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via [this repository](https://github.com/LUMII-AILab/LVBERT).
|
11 |
|
12 |
This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding (NLU) tasks.
|
13 |
|
14 |
Developed at [AiLab.lv](https://ailab.lv)
|
15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
## Citation
|
17 |
|
18 |
Please cite this paper if you use LVBERT:
|
|
|
6 |
|
7 |
# Latvian BERT base model (cased)
|
8 |
|
9 |
+
A BERT model pretrained on the Latvian language using the masked language modeling and next sentence prediction objectives.
|
10 |
It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via [this repository](https://github.com/LUMII-AILab/LVBERT).
|
11 |
|
12 |
This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding (NLU) tasks.
|
13 |
|
14 |
Developed at [AiLab.lv](https://ailab.lv)
|
15 |
|
16 |
+
## Training data
|
17 |
+
|
18 |
+
LVBERT was pretrained on texts from the [Balanced Corpus of Modern Latvian](https://korpuss.lv/en/id/LVK2018), [Latvian Wikipedia](https://korpuss.lv/en/id/Vikipēdija), [Corpus of News Portal Articles](https://korpuss.lv/en/id/Ziņas), as well as [Corpus of News Portal Comments](https://korpuss.lv/en/id/Barometrs); 500M tokens in total.
|
19 |
+
|
20 |
+
## Tokenization
|
21 |
+
|
22 |
+
A SentencePiece model was trained on the training dataset, producing a vocabulary of 32,000 tokens.
|
23 |
+
It was then converted to the WordPiece format used by BERT.
|
24 |
+
|
25 |
+
## Pretraining
|
26 |
+
|
27 |
+
We used the BERT-base configuration with 12 layers, 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size and 32k token vocabulary.
|
28 |
+
|
29 |
## Citation
|
30 |
|
31 |
Please cite this paper if you use LVBERT:
|