File size: 6,510 Bytes
e1366ac
 
 
 
 
 
 
 
 
 
 
3a40d19
 
e1366ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a40d19
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
license: mit
language:
- sk
pipeline_tag: token-classification
library_name: transformers
metrics:
- f1
base_model: daviddrzik/SK_BPE_BLM
tags:
- ner
datasets:
- NaiveNeuron/wikigoldsk
---

# Fine-Tuned Named Entity Recognition (NER) Model - SK_BPE_BLM (NER Tags)

## Model Overview
This model is a fine-tuned version of the [SK_BPE_BLM model](https://huggingface.co/daviddrzik/SK_BPE_BLM) for tokenization and Named Entity Recognition (NER). For this task, we utilized the manually annotated [WikiGoldSK dataset]( https://github.com/NaiveNeuron/WikiGoldSK), which was created from 412 articles from the Slovak Wikipedia. The dataset contains annotations for four main categories of entities: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC).

## NER Tags
Each token in the dataset is annotated with one of the following NER tags:
- **O (0):** Regular text (not an entity)
- **B-PER (1):** Beginning of a person entity
- **I-PER (2):** Continuation of a person entity
- **B-LOC (3):** Beginning of a location entity
- **I-LOC (4):** Continuation of a location entity
- **B-ORG (5):** Beginning of an organization entity
- **I-ORG (6):** Continuation of an organization entity
- **B-MISC (7):** Beginning of a miscellaneous entity
- **I-MISC (8):** Continuation of a miscellaneous entity

## Dataset Details
The WikiGoldSK dataset, which contains a total of **6,633** sequences, was adapted for this NER task. The dataset was originally split into training, validation, and test sets, but for our research, we combined all parts and evaluated the model using stratified 10-fold cross-validation. Each token in the text, including words and punctuation, was annotated with the appropriate NER tag.

## Fine-Tuning Hyperparameters

The following hyperparameters were used during the fine-tuning process:

- **Learning Rate:** 3e-05
- **Training Batch Size:** 64
- **Evaluation Batch Size:** 64
- **Seed:** 42
- **Optimizer:** Adam (default)
- **Number of Epochs:** 10

## Model Performance

The model was evaluated using stratified 10-fold cross-validation, achieving a weighted F1-score with a median value of <span style="font-size: 24px;">**0.9565**</span>.

## Model Usage

This model is suitable for tokenization and NER tasks in Slovak text. It is specifically designed for applications requiring accurate identification and categorization of named entities in various Slovak texts.

### Example Usage

Below is an example of how to use the fine-tuned `SK_Morph_BLM-ner ` model in a Python script:

```python
import torch
from transformers import RobertaForTokenClassification, RobertaTokenizerFast
from huggingface_hub import hf_hub_download
import json

class TokenClassifier:
    def __init__(self, model, tokenizer):
        self.model = RobertaForTokenClassification.from_pretrained(model, num_labels=10)
        self.tokenizer = RobertaTokenizerFast.from_pretrained(tokenizer, max_length=256)
        byte_utf8_mapping_path = hf_hub_download(repo_id=tokenizer, filename="byte_utf8_mapping.json")
        with open(byte_utf8_mapping_path, "r", encoding="utf-8") as f:
            self.byte_utf8_mapping = json.load(f)
            
    def decode(self, tokens):
        decoded_tokens = []
        for token in tokens:
            for k, v in self.byte_utf8_mapping.items():
                if k in token:
                    token = token.replace(k, v)
                token = token.replace("Ġ"," ")
            decoded_tokens.append(token)
        return decoded_tokens

    def tokenize_text(self, text):
        encoded_text = self.tokenizer(text.lower(), max_length=256, padding='max_length', truncation=True, return_tensors='pt')
        return encoded_text

    def classify_tokens(self, text):
        encoded_text = self.tokenize_text(text)
        tokens = self.tokenizer.convert_ids_to_tokens(encoded_text['input_ids'].squeeze().tolist())

        with torch.no_grad():
            output = self.model(**encoded_text)
            logits = output.logits
            predictions = torch.argmax(logits, dim=-1)

            active_loss = encoded_text['attention_mask'].view(-1) == 1
            active_logits = logits.view(-1, self.model.config.num_labels)[active_loss]
            active_predictions = predictions.view(-1)[active_loss]

            probabilities = torch.softmax(active_logits, dim=-1)

            results = []
            for token, pred, prob in zip(self.decode(tokens), active_predictions.tolist(), probabilities.tolist()):
                if token not in ['<s>', '</s>', '<pad>']:
                    result = f"Token: {token: <10}  NER tag: ({self.model.config.id2label[pred]} = {max(prob):.4f})"
                    results.append(result)

        return results

# Instantiate the NER classifier with the specified tokenizer and model
classifier = TokenClassifier(tokenizer="daviddrzik/SK_BPE_BLM", model="daviddrzik/SK_BPE_BLM-ner")

# Tokenize the input text
text_to_classify = "Dávid Držík je interný doktorand na Fakulte prírodných vied a informatiky UKF v Nitre na Slovensku."

# Classify the NER tags of the tokenized text
classification_results = classifier.classify_tokens(text_to_classify)
print(f"============= NER Token Classification =============")
print("Text to classify:", text_to_classify)
for classification_result in classification_results:
    print(classification_result)
```

Example Output
Here is the output when running the above example:
```yaml
============= NER Token Classification =============
Text to classify: Dávid Držík je interný doktorand na Fakulte prírodných vied a informatiky UKF v Nitre na Slovensku.
Token: dá          NER tag: (B-PER = 0.9673)
Token: vid         NER tag: (B-PER = 0.9816)
Token:  drží       NER tag: (I-PER = 0.6309)
Token: k           NER tag: (I-PER = 0.6584)
Token:  je         NER tag: (O = 0.9970)
Token:  inter      NER tag: (O = 0.9005)
Token: ný          NER tag: (O = 0.9833)
Token:  doktorand  NER tag: (O = 0.8623)
Token:  na         NER tag: (O = 0.9965)
Token:  fakulte    NER tag: (B-ORG = 0.9886)
Token:  prírodných  NER tag: (I-ORG = 0.8822)
Token:  vied       NER tag: (I-ORG = 0.9970)
Token:  a          NER tag: (I-ORG = 0.9908)
Token:  informatiky  NER tag: (I-ORG = 0.9849)
Token:  ukf        NER tag: (I-ORG = 0.9112)
Token:  v          NER tag: (I-ORG = 0.9969)
Token:  nitre      NER tag: (I-ORG = 0.9790)
Token:  na         NER tag: (O = 0.9744)
Token:  slovensku  NER tag: (B-LOC = 0.9944)
Token: .           NER tag: (O = 0.9767)
```