File size: 2,036 Bytes
40d16b4
33f5bd2
 
 
40d16b4
 
33f5bd2
40d16b4
4a8c2a8
 
 
33f5bd2
4a8c2a8
 
33f5bd2
fc6e6b2
 
4a8c2a8
fc6e6b2
 
 
 
 
 
 
 
4dfa048
fc6e6b2
69ffc04
33f5bd2
69ffc04
33f5bd2
 
40d16b4
33f5bd2
 
 
 
 
 
 
 
 
 
40d16b4
ca746aa
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
license: apache-2.0
language:
- lv
---

# Latvian BERT base model (cased)

A BERT model pretrained on Latvian language data using the masked language modeling and next sentence prediction objectives. 
It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via a [GitHub repository](https://github.com/LUMII-AILab/LVBERT). 
The current HF repository contains an improved version of LVBERT.

This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding tasks like text classification, named entity recognition, question answering. 
However, the model can be used as is to compute contextual embeddings for tasks like text similarity and clustering, semantic search.

## Training data

LVBERT was pretrained on texts from the [Balanced Corpus of Modern Latvian](https://korpuss.lv/en/id/LVK2018), [Latvian Wikipedia](https://korpuss.lv/en/id/Vikipēdija), [Corpus of News Portal Articles](https://korpuss.lv/en/id/Ziņas), as well as [Corpus of News Portal Comments](https://korpuss.lv/en/id/Barometrs); around 500M tokens in total.

## Tokenization

A SentencePiece model was trained on the training dataset, producing a vocabulary of 32,000 tokens. 
It was then converted to the WordPiece format used by BERT.

## Pretraining

We used the BERT-base configuration with 12 layers, 768 hidden units, 12 heads, 512 sequence length, 128 mini-batch size and 32k token vocabulary.

## Citation

Please cite this paper if you use LVBERT:

```bibtex
@inproceedings{Znotins-Barzdins:2020:BalticHLT,
  author = {Arturs Znotins and Guntis Barzdins},
  title = {{LVBERT: Transformer-Based Model for Latvian Language Understanding}},
  booktitle = {Human Language Technologies - The Baltic Perspective},
  series = {Frontiers in Artificial Intelligence and Applications},
  volume = {328},
  publisher = {IOS Press},
  year = {2020},
  pages = {111-115},
  doi = {10.3233/FAIA200610},
  url = {http://ebooks.iospress.nl/volumearticle/55531}
}
```