|
--- |
|
license: apache-2.0 |
|
language: |
|
- lv |
|
--- |
|
|
|
# Latvian BERT base model (cased) |
|
|
|
A BERT model pretrained on the Latvian language using the masked language modeling and next sentence prediction objectives. |
|
It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via [this repository](https://github.com/LUMII-AILab/LVBERT). |
|
|
|
This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding (NLU) tasks. |
|
|
|
Developed at [AiLab.lv](https://ailab.lv) |
|
|
|
## Training data |
|
|
|
LVBERT was pretrained on texts from the [Balanced Corpus of Modern Latvian](https://korpuss.lv/en/id/LVK2018), [Latvian Wikipedia](https://korpuss.lv/en/id/Vikipēdija), [Corpus of News Portal Articles](https://korpuss.lv/en/id/Ziņas), as well as [Corpus of News Portal Comments](https://korpuss.lv/en/id/Barometrs); 500M tokens in total. |
|
|
|
## Tokenization |
|
|
|
A SentencePiece model was trained on the training dataset, producing a vocabulary of 32,000 tokens. |
|
It was then converted to the WordPiece format used by BERT. |
|
|
|
## Pretraining |
|
|
|
We used the BERT-base configuration with 12 layers, 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size and 32k token vocabulary. |
|
|
|
## Citation |
|
|
|
Please cite this paper if you use LVBERT: |
|
|
|
```bibtex |
|
@inproceedings{Znotins-Barzdins:2020:BalticHLT, |
|
author = {Arturs Znotins and Guntis Barzdins}, |
|
title = {{LVBERT: Transformer-Based Model for Latvian Language Understanding}}, |
|
booktitle = {Human Language Technologies - The Baltic Perspective}, |
|
series = {Frontiers in Artificial Intelligence and Applications}, |
|
volume = {328}, |
|
publisher = {IOS Press}, |
|
year = {2020}, |
|
pages = {111-115}, |
|
doi = {10.3233/FAIA200610}, |
|
url = {http://ebooks.iospress.nl/volumearticle/55531} |
|
} |
|
``` |
|
|