File size: 1,870 Bytes
eb225fa
cf824dc
 
 
 
 
 
 
eb225fa
 
cf824dc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
language: 
  - pl
  - cs
  - ru
tags:
  - mT5
  - lemmatization
license: apache-2.0
---


# SlavLemma Large

SlavLemma models are intended for lemmatization of named entities and multi-word expressions in Polish, Czech and Russian languages.

They were fine-tuned from the google/mT5 models, e.g.: [google/mt5-large](https://huggingface.co/google/mt5-large).

## Usage

When using the model, prepend one of the language tokens (`>>pl<<`, `>>cs<<`, `>>ru<<`) to the input, based on the language of the phrase you want to lemmatize.

Sample usage:

```
from transformers import pipeline

pipe = pipeline(task="text2text-generation", model="amu-cai/slavlemma-large", tokenizer="amu-cai/slavlemma-large")
hyp = [res['generated_text'] for res in pipe([">>pl<< federalnego urzędu statystycznego"], clean_up_tokenization_spaces=True, num_beams=5)][0]
```


## Evaluation results

Lemmatization Exact Match was computed on the SlavNER 2021 test sets (COVID-19 and USA 2020 Elections).


COVID-19:
| Model | pl | cs | ru |
| :------ | ------: | ------: | ------: |
| [slavlemma-large](https://huggingface.co/amu-cai/slavlemma-large) | 93.76 | 89.80 | 77.30
| [slavlemma-base](https://huggingface.co/amu-cai/slavlemma-base) | 91.00 |86.29| 76.10
| [slavlemma-small](https://huggingface.co/amu-cai/slavlemma-small)| 86.80 |80.98| 73.83

USA 2020 Elections:
| Model | pl | cs | ru |
| :------ | ------: | ------: | ------: |
| [slavlemma-large](https://huggingface.co/amu-cai/slavlemma-large) | 89.12 | 87.27| 82.50
| [slavlemma-base](https://huggingface.co/amu-cai/slavlemma-base) | 84.19 |81.97| 80.27
| [slavlemma-small](https://huggingface.co/amu-cai/slavlemma-small)| 78.85 |75.86| 76.18


## Citation

If you use the model, please cite the following paper:

TBD

### Framework versions

- Transformers 4.26.0
- Pytorch 1.13.1.post200
- Datasets 2.9.0
- Tokenizers 0.13.2