AiLab-IMCS-UL
/

lvbert

Feature Extraction

Inference Endpoints

Model card Files Files and versions Community

lvbert / README.md

normundsg's picture

Updated README

fc6e6b2 verified 9 months ago

|

1.8 kB

	---
	license: apache-2.0
	language:
	- lv
	---

	# Latvian BERT base model (cased)

	A BERT model pretrained on the Latvian language using the masked language modeling and next sentence prediction objectives.
	It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via [this repository](https://github.com/LUMII-AILab/LVBERT).

	This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding (NLU) tasks.

	Developed at [AiLab.lv](https://ailab.lv)

	## Training data

	LVBERT was pretrained on texts from the [Balanced Corpus of Modern Latvian](https://korpuss.lv/en/id/LVK2018), [Latvian Wikipedia](https://korpuss.lv/en/id/Vikipēdija), [Corpus of News Portal Articles](https://korpuss.lv/en/id/Ziņas), as well as [Corpus of News Portal Comments](https://korpuss.lv/en/id/Barometrs); 500M tokens in total.

	## Tokenization

	A SentencePiece model was trained on the training dataset, producing a vocabulary of 32,000 tokens.
	It was then converted to the WordPiece format used by BERT.

	## Pretraining

	We used the BERT-base configuration with 12 layers, 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size and 32k token vocabulary.

	## Citation

	Please cite this paper if you use LVBERT:

	```bibtex
	@inproceedings{Znotins-Barzdins:2020:BalticHLT,
	author = {Arturs Znotins and Guntis Barzdins},
	title = {{LVBERT: Transformer-Based Model for Latvian Language Understanding}},
	booktitle = {Human Language Technologies - The Baltic Perspective},
	series = {Frontiers in Artificial Intelligence and Applications},
	volume = {328},
	publisher = {IOS Press},
	year = {2020},
	pages = {111-115},
	doi = {10.3233/FAIA200610},
	url = {http://ebooks.iospress.nl/volumearticle/55531}
	}
	```