normundsg commited on
Commit
fc6e6b2
1 Parent(s): 69ffc04

Updated README

Browse files
Files changed (1) hide show
  1. README.md +14 -1
README.md CHANGED
@@ -6,13 +6,26 @@ language:
6
 
7
  # Latvian BERT base model (cased)
8
 
9
- A pretrained model on the Latvian language using a masked language modeling (MLM) objective.
10
  It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via [this repository](https://github.com/LUMII-AILab/LVBERT).
11
 
12
  This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding (NLU) tasks.
13
 
14
  Developed at [AiLab.lv](https://ailab.lv)
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  ## Citation
17
 
18
  Please cite this paper if you use LVBERT:
 
6
 
7
  # Latvian BERT base model (cased)
8
 
9
+ A BERT model pretrained on the Latvian language using the masked language modeling and next sentence prediction objectives.
10
  It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via [this repository](https://github.com/LUMII-AILab/LVBERT).
11
 
12
  This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding (NLU) tasks.
13
 
14
  Developed at [AiLab.lv](https://ailab.lv)
15
 
16
+ ## Training data
17
+
18
+ LVBERT was pretrained on texts from the [Balanced Corpus of Modern Latvian](https://korpuss.lv/en/id/LVK2018), [Latvian Wikipedia](https://korpuss.lv/en/id/Vikipēdija), [Corpus of News Portal Articles](https://korpuss.lv/en/id/Ziņas), as well as [Corpus of News Portal Comments](https://korpuss.lv/en/id/Barometrs); 500M tokens in total.
19
+
20
+ ## Tokenization
21
+
22
+ A SentencePiece model was trained on the training dataset, producing a vocabulary of 32,000 tokens.
23
+ It was then converted to the WordPiece format used by BERT.
24
+
25
+ ## Pretraining
26
+
27
+ We used the BERT-base configuration with 12 layers, 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size and 32k token vocabulary.
28
+
29
  ## Citation
30
 
31
  Please cite this paper if you use LVBERT: