metadata
language:
- pl
- cs
- ru
tags:
- mT5
- lemmatization
license: apache-2.0
SlavLemma Large
SlavLemma models are intended for lemmatization of named entities and multi-word expressions in Polish, Czech and Russian languages.
They were fine-tuned from the google/mT5 models, e.g.: google/mt5-large.
Usage
When using the model, prepend one of the language tokens (>>pl<<
, >>cs<<
, >>ru<<
) to the input, based on the language of the phrase you want to lemmatize.
Sample usage:
from transformers import pipeline
pipe = pipeline(task="text2text-generation", model="amu-cai/slavlemma-large", tokenizer="amu-cai/slavlemma-large")
hyp = [res['generated_text'] for res in pipe([">>pl<< federalnego urzędu statystycznego"], clean_up_tokenization_spaces=True, num_beams=5)][0]
Evaluation results
Lemmatization Exact Match was computed on the SlavNER 2021 test sets (COVID-19 and USA 2020 Elections).
COVID-19:
Model | pl | cs | ru |
---|---|---|---|
slavlemma-large | 93.76 | 89.80 | 77.30 |
slavlemma-base | 91.00 | 86.29 | 76.10 |
slavlemma-small | 86.80 | 80.98 | 73.83 |
USA 2020 Elections:
Model | pl | cs | ru |
---|---|---|---|
slavlemma-large | 89.12 | 87.27 | 82.50 |
slavlemma-base | 84.19 | 81.97 | 80.27 |
slavlemma-small | 78.85 | 75.86 | 76.18 |
Citation
If you use the model, please cite the following paper:
TBD
Framework versions
- Transformers 4.26.0
- Pytorch 1.13.1.post200
- Datasets 2.9.0
- Tokenizers 0.13.2