|
--- |
|
language: |
|
- is |
|
library_name: transformers |
|
--- |
|
# Icebreaker tokenizer |
|
|
|
This is a BPE tokenizer trained on the Iceladic Gigaword Corpus, News 1. The tokenizer can be used for training Icelandic language models. |
|
|
|
## Model Details |
|
|
|
BPE tokenizer, trained on the first 242553 files in the News 1 IGC 2022, unnanotated dataset by Arnastofnun. |
|
|
|
### Model Description |
|
|
|
It has a vocab size of 3200. |
|
|
|
|
|
- **Developed by:** Sigurdur Haukur Birgisson |
|
- **Model type:** GPT2Tokenizer |
|
- **Language(s) (NLP):** Icelandic |
|
|
|
### Model Sources |
|
|
|
- **Repository:** https://github.com/sigurdurhaukur/tokenicer |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```py |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Sigurdur/icebreaker") |
|
tokens = tokenizer("Halló heimur!") |
|
``` |
|
|
|
## Model Card Contact |
|
|
|
Sigurdur Haukur Birgissson: [email protected] |