|
--- |
|
language: |
|
- pl |
|
- cs |
|
- ru |
|
tags: |
|
- mT5 |
|
- lemmatization |
|
license: apache-2.0 |
|
--- |
|
|
|
|
|
# SlavLemma Large |
|
|
|
SlavLemma models are intended for lemmatization of named entities and multi-word expressions in Polish, Czech and Russian languages. |
|
|
|
They were fine-tuned from the google/mT5 models, e.g.: [google/mt5-large](https://huggingface.co/google/mt5-large). |
|
|
|
## Usage |
|
|
|
When using the model, prepend one of the language tokens (`>>pl<<`, `>>cs<<`, `>>ru<<`) to the input, based on the language of the phrase you want to lemmatize. |
|
|
|
Sample usage: |
|
|
|
``` |
|
from transformers import pipeline |
|
|
|
pipe = pipeline(task="text2text-generation", model="amu-cai/slavlemma-large", tokenizer="amu-cai/slavlemma-large") |
|
hyp = [res['generated_text'] for res in pipe([">>pl<< federalnego urzędu statystycznego"], clean_up_tokenization_spaces=True, num_beams=5)][0] |
|
``` |
|
|
|
|
|
## Evaluation results |
|
|
|
Lemmatization Exact Match was computed on the SlavNER 2021 test sets (COVID-19 and USA 2020 Elections). |
|
|
|
|
|
COVID-19: |
|
| Model | pl | cs | ru | |
|
| :------ | ------: | ------: | ------: | |
|
| [slavlemma-large](https://huggingface.co/amu-cai/slavlemma-large) | 93.76 | 89.80 | 77.30 |
|
| [slavlemma-base](https://huggingface.co/amu-cai/slavlemma-base) | 91.00 |86.29| 76.10 |
|
| [slavlemma-small](https://huggingface.co/amu-cai/slavlemma-small)| 86.80 |80.98| 73.83 |
|
|
|
USA 2020 Elections: |
|
| Model | pl | cs | ru | |
|
| :------ | ------: | ------: | ------: | |
|
| [slavlemma-large](https://huggingface.co/amu-cai/slavlemma-large) | 89.12 | 87.27| 82.50 |
|
| [slavlemma-base](https://huggingface.co/amu-cai/slavlemma-base) | 84.19 |81.97| 80.27 |
|
| [slavlemma-small](https://huggingface.co/amu-cai/slavlemma-small)| 78.85 |75.86| 76.18 |
|
|
|
|
|
## Citation |
|
|
|
If you use the model, please cite the following paper: |
|
|
|
TBD |
|
|
|
### Framework versions |
|
|
|
- Transformers 4.26.0 |
|
- Pytorch 1.13.1.post200 |
|
- Datasets 2.9.0 |
|
- Tokenizers 0.13.2 |
|
|