--- license: apache-2.0 language: - lv tags: - feature-extraction - fill-mask widgets: - task: feature-extraction text: "Latvijā ir 10 valstspilsētas." - task: fill-mask text: "Rīga ir Latvijas [MASK]." --- # Latvian BERT base model (cased) A BERT model pretrained on the Latvian language using the masked language modeling and next sentence prediction objectives. It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via [this repository](https://github.com/LUMII-AILab/LVBERT). This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding (NLU) tasks. Developed at [AiLab.lv](https://ailab.lv) ## Training data LVBERT was pretrained on texts from the [Balanced Corpus of Modern Latvian](https://korpuss.lv/en/id/LVK2018), [Latvian Wikipedia](https://korpuss.lv/en/id/Vikipēdija), [Corpus of News Portal Articles](https://korpuss.lv/en/id/Ziņas), as well as [Corpus of News Portal Comments](https://korpuss.lv/en/id/Barometrs); 500M tokens in total. ## Tokenization A SentencePiece model was trained on the training dataset, producing a vocabulary of 32,000 tokens. It was then converted to the WordPiece format used by BERT. ## Pretraining We used the BERT-base configuration with 12 layers, 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size and 32k token vocabulary. ## Citation Please cite this paper if you use LVBERT: ```bibtex @inproceedings{Znotins-Barzdins:2020:BalticHLT, author = {Arturs Znotins and Guntis Barzdins}, title = {{LVBERT: Transformer-Based Model for Latvian Language Understanding}}, booktitle = {Human Language Technologies - The Baltic Perspective}, series = {Frontiers in Artificial Intelligence and Applications}, volume = {328}, publisher = {IOS Press}, year = {2020}, pages = {111-115}, doi = {10.3233/FAIA200610}, url = {http://ebooks.iospress.nl/volumearticle/55531} } ```