lvbert / README.md
normundsg's picture
Updated README
fc6e6b2 verified
|
raw
history blame
1.8 kB
---
license: apache-2.0
language:
- lv
---
# Latvian BERT base model (cased)
A BERT model pretrained on the Latvian language using the masked language modeling and next sentence prediction objectives.
It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via [this repository](https://github.com/LUMII-AILab/LVBERT).
This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding (NLU) tasks.
Developed at [AiLab.lv](https://ailab.lv)
## Training data
LVBERT was pretrained on texts from the [Balanced Corpus of Modern Latvian](https://korpuss.lv/en/id/LVK2018), [Latvian Wikipedia](https://korpuss.lv/en/id/Vikipēdija), [Corpus of News Portal Articles](https://korpuss.lv/en/id/Ziņas), as well as [Corpus of News Portal Comments](https://korpuss.lv/en/id/Barometrs); 500M tokens in total.
## Tokenization
A SentencePiece model was trained on the training dataset, producing a vocabulary of 32,000 tokens.
It was then converted to the WordPiece format used by BERT.
## Pretraining
We used the BERT-base configuration with 12 layers, 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size and 32k token vocabulary.
## Citation
Please cite this paper if you use LVBERT:
```bibtex
@inproceedings{Znotins-Barzdins:2020:BalticHLT,
author = {Arturs Znotins and Guntis Barzdins},
title = {{LVBERT: Transformer-Based Model for Latvian Language Understanding}},
booktitle = {Human Language Technologies - The Baltic Perspective},
series = {Frontiers in Artificial Intelligence and Applications},
volume = {328},
publisher = {IOS Press},
year = {2020},
pages = {111-115},
doi = {10.3233/FAIA200610},
url = {http://ebooks.iospress.nl/volumearticle/55531}
}
```