File size: 2,565 Bytes
18726e6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f94e30f
 
 
18726e6
 
 
f94e30f
18726e6
 
 
cfc0234
18726e6
 
 
 
 
 
cfc0234
 
18726e6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f94e30f
 
18726e6
f94e30f
 
 
 
18726e6
 
 
 
 
 
 
 
 
 
 
 
f94e30f
18726e6
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
language: pl
license: mit
tags:
  - ner
datasets:
  - clarin-pl/kpwr-ner
metrics:
  - f1
  - accuracy
  - precision
  - recall
widget:
  - text: "Nazywam się Jan Kowalski i mieszkam we Wrocławiu."
    example_title: "Example"
---

# FastPDN

FastPolDeepNer is model for Named Entity Recognition, designed for easy use, training and configuration. The forerunner of this project is [PolDeepNer2](https://gitlab.clarin-pl.eu/information-extraction/poldeepner2). The model implements a pipeline consisting of data processing and training using: hydra, pytorch, pytorch-lightning, transformers.

Source code: https://gitlab.clarin-pl.eu/grupa-wieszcz/ner/fast-pdn

## How to use

Here is how to use this model to get Named Entities in text:

```python
from transformers import pipeline
ner = pipeline('ner', model='clarin-pl/FastPDN', aggregation_strategy='simple')

text = "Nazywam się Jan Kowalski i mieszkam we Wrocławiu."
ner_results = ner(text)
for output in ner_results:
    print(output)

{'entity_group': 'nam_liv_person', 'score': 0.9996054, 'word': 'Jan Kowalski', 'start': 12, 'end': 24}
{'entity_group': 'nam_loc_gpe_city', 'score': 0.998931, 'word': 'Wrocławiu', 'start': 39, 'end': 48}
```

Here is how to use this model to get the logits for every token in text:

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("clarin-pl/FastPDN")
model = AutoModelForTokenClassification.from_pretrained("clarin-pl/FastPDN")

text = "Nazywam się Jan Kowalski i mieszkam we Wrocławiu."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```

## Training data
The FastPDN model was trained on datasets (with 82 class versions) of kpwr and cen. Annotation guidelines are specified [here](https://clarin-pl.eu/dspace/bitstream/handle/11321/294/WytyczneKPWr-jednostkiidentyfikacyjne.pdf).

## Pretraining
FastPDN models have been fine-tuned, thanks to pretrained models:
- [herbert-base-case](https://huggingface.co/allegro/herbert-base-cased)
- [distiluse-base-multilingual-cased-v1](sentence-transformers/distiluse-base-multilingual-cased-v1)
## Evaluation

Runs trained on `cen_n82` and `kpwr_n82`:
| name |test/f1|test/pdn2_f1|test/acc|test/precision|test/recall|
|---------|-------|------------|--------|--------------|-----------|
|distiluse| 0.53 | 0.61 | 0.95 | 0.55 | 0.54 |
| herbert | 0.68 | 0.78 | 0.97 | 0.7 | 0.69 |


## Authors

- Grupa Wieszcze CLARIN-PL
- Wiktor Walentynowicz

## Contact

- Norbert Ropiak ([email protected])