metadata

language:
  - pl
  - cs
  - ru
tags:
  - mT5
  - lemmatization
license: apache-2.0

SlavLemma Large

SlavLemma models are intended for lemmatization of named entities and multi-word expressions in Polish, Czech and Russian languages.

They were fine-tuned from the google/mT5 models, e.g.: google/mt5-large.

Usage

When using the model, prepend one of the language tokens (>>pl<<, >>cs<<, >>ru<<) to the input, based on the language of the phrase you want to lemmatize.

Sample usage:

from transformers import pipeline

pipe = pipeline(task="text2text-generation", model="amu-cai/slavlemma-large", tokenizer="amu-cai/slavlemma-large")
hyp = [res['generated_text'] for res in pipe([">>pl<< federalnego urzędu statystycznego"], clean_up_tokenization_spaces=True, num_beams=5)][0]

Evaluation results

Lemmatization Exact Match was computed on the SlavNER 2021 test sets (COVID-19 and USA 2020 Elections).

COVID-19:

Model	pl	cs	ru
slavlemma-large	93.76	89.80	77.30
slavlemma-base	91.00	86.29	76.10
slavlemma-small	86.80	80.98	73.83

USA 2020 Elections:

Model	pl	cs	ru
slavlemma-large	89.12	87.27	82.50
slavlemma-base	84.19	81.97	80.27
slavlemma-small	78.85	75.86	76.18

amu-cai
/

slavlemma-large

SlavLemma Large

Usage

Evaluation results

Citation

Framework versions