EMBEDDIA
/

sloberta

Inference Endpoints

Model card Files Files and versions Community

sloberta / README.md

matejulcar's picture

Update README.md

0d43db2 over 3 years ago

|

1.3 kB

	---
	language:
	- sl

	license: cc-by-sa-4.0
	---

	# SloBERTa
	SloBERTa model is a monolingual Slovene BERT-like model. It is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool

	SloBERTa was trained for 200,000 iterations or about 98 epochs.

	## Corpora
	The following corpora were used for training the model:
	* Gigafida 2.0
	* Kas 1.0
	* Janes 1.0 (only Janes-news, Janes-forum, Janes-blog, Janes-wiki subcorpora)
	* Slovenian parliamentary corpus siParl 2.0
	* slWaC

	# Usage
	Load in transformers library with:
	```
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("EMBEDDIA/sloberta", use_fast=False)
	model = AutoModelForMaskedLM.from_pretrained("EMBEDDIA/sloberta")
	```
	Note: it is currently critically important to add `use_fast=False` parameter to tokenizer. By default it attempts to load a fast tokenizer, which will work (ie. not result in an error), but it will not correctly map tokens to its IDs and the performance on any task will be extremely bad.