language: | |
- sl | |
- en | |
- multilingual | |
tags: | |
- generated_from_trainer | |
licence: cc-by-sa-4.0 | |
# SloBERTa-SlEng | |
SloBERTa-SlEng is a masked language model, based on the [SloBERTa](https://huggingface.co/EMBEDDIA/sloberta) Slovene model. | |
SloBERTa-SlEng replaces the tokenizer, vocabulary and the embeddings layer of the SloBERTa model. | |
The tokenizer and vocabulary used are bilingual, Slovene-English, based on conversational, non-standard, and slang language the model was trained on. | |
They are the same as in the [SlEng-bert](https://huggingface.co/cjvt/sleng-bert) model. | |
The new embedding weights were initialized from the SloBERTa embeddings. | |
The new SloBERTa-SlEng model is SloBERTa model, which was further pre-trained for two epochs on the conversational English and Slovene corpora, | |
the same as the [SlEng-bert](https://huggingface.co/cjvt/sleng-bert) model. | |
## Training corpora | |
The model was trained on English and Slovene tweets, Slovene corpora [MaCoCu](http://hdl.handle.net/11356/1517) and [Frenk](http://hdl.handle.net/11356/1201), | |
and a small subset of English [Oscar](https://huggingface.co/datasets/oscar) corpus. We tried to keep the sizes of English and Slovene corpora as equal as possible. | |
Training corpora had in total about 2.7 billion words. | |
### Framework versions | |
- Transformers 4.22.0.dev0 | |
- Pytorch 1.13.0a0+d321be6 | |
- Datasets 2.4.0 | |
- Tokenizers 0.12.1 | |