AiLab-IMCS-UL
/

lvbert

Feature Extraction

Inference Endpoints

Model card Files Files and versions Community

normundsg commited on Apr 9

Commit

fc6e6b2

•

1 Parent(s): 69ffc04

Updated README

Files changed (1) hide show

README.md +14 -1

README.md CHANGED Viewed

@@ -6,13 +6,26 @@ language:
 # Latvian BERT base model (cased)
-A pretrained model on the Latvian language using a masked language modeling (MLM) objective.
 It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via [this repository](https://github.com/LUMII-AILab/LVBERT).
 This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding (NLU) tasks.
 Developed at [AiLab.lv](https://ailab.lv)
 ## Citation
 Please cite this paper if you use LVBERT:

 # Latvian BERT base model (cased)
+A BERT model pretrained on the Latvian language using the masked language modeling and next sentence prediction objectives.
 It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via [this repository](https://github.com/LUMII-AILab/LVBERT).
 This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding (NLU) tasks.
 Developed at [AiLab.lv](https://ailab.lv)
+## Training data
+LVBERT was pretrained on texts from the [Balanced Corpus of Modern Latvian](https://korpuss.lv/en/id/LVK2018), [Latvian Wikipedia](https://korpuss.lv/en/id/Vikipēdija), [Corpus of News Portal Articles](https://korpuss.lv/en/id/Ziņas), as well as [Corpus of News Portal Comments](https://korpuss.lv/en/id/Barometrs); 500M tokens in total.
+## Tokenization
+A SentencePiece model was trained on the training dataset, producing a vocabulary of 32,000 tokens.
+It was then converted to the WordPiece format used by BERT.
+## Pretraining
+We used the BERT-base configuration with 12 layers, 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size and 32k token vocabulary.
 ## Citation
 Please cite this paper if you use LVBERT: