---
license: apache-2.0
language:
- lv
---

# Latvian BERT base model (cased)

A BERT model pretrained on the Latvian language using the masked language modeling and next sentence prediction objectives. 
It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via [this repository](https://github.com/LUMII-AILab/LVBERT).

This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding (NLU) tasks.

Developed at [AiLab.lv](https://ailab.lv)

## Training data

LVBERT was pretrained on texts from the [Balanced Corpus of Modern Latvian](https://korpuss.lv/en/id/LVK2018), [Latvian Wikipedia](https://korpuss.lv/en/id/Vikipēdija), [Corpus of News Portal Articles](https://korpuss.lv/en/id/Ziņas), as well as [Corpus of News Portal Comments](https://korpuss.lv/en/id/Barometrs); 500M tokens in total.

## Tokenization

A SentencePiece model was trained on the training dataset, producing a vocabulary of 32,000 tokens. 
It was then converted to the WordPiece format used by BERT.

## Pretraining

We used the BERT-base configuration with 12 layers, 768 hidden units, 12 heads, 512 sequence length, 128 mini-batch size and 32k token vocabulary.

## Citation

Please cite this paper if you use LVBERT:

```bibtex
@inproceedings{Znotins-Barzdins:2020:BalticHLT,
  author = {Arturs Znotins and Guntis Barzdins},
  title = {{LVBERT: Transformer-Based Model for Latvian Language Understanding}},
  booktitle = {Human Language Technologies - The Baltic Perspective},
  series = {Frontiers in Artificial Intelligence and Applications},
  volume = {328},
  publisher = {IOS Press},
  year = {2020},
  pages = {111-115},
  doi = {10.3233/FAIA200610},
  url = {http://ebooks.iospress.nl/volumearticle/55531}
}
```