daviddrzik
commited on
Commit
•
242b467
1
Parent(s):
761f4c3
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,158 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- sk
|
5 |
+
pipeline_tag: token-classification
|
6 |
+
library_name: transformers
|
7 |
+
metrics:
|
8 |
+
- f1
|
9 |
+
base_model: daviddrzik/SK_Morph_BLM
|
10 |
+
tags:
|
11 |
+
- pos-tagging
|
12 |
+
---
|
13 |
+
|
14 |
+
# Fine-Tuned POS Tagging Model - SK_Morph_BLM (POS Tags)
|
15 |
+
|
16 |
+
## Model Overview
|
17 |
+
This model is a fine-tuned version of the [SK_Morph_BLM model](https://huggingface.co/daviddrzik/SK_Morph_BLM) for tokenization and POS tagging. For this task, we used the [UD Slovak SNK dataset](https://github.com/UniversalDependencies/UD_Slovak-SNK), which is part of the Universal Dependencies project. This dataset contains annotated Slovak texts with various linguistic information, including UPOS tags, morphological features, syntactic relations, and lemmatization. We focused on UPOS tags, which provide basic categories of parts of speech.
|
18 |
+
|
19 |
+
## POS Tags
|
20 |
+
Each token in the dataset is annotated with one of the following POS tags:
|
21 |
+
- **NOUN (0):** Nouns
|
22 |
+
- **PUNCT (1):** Punctuation marks
|
23 |
+
- **VERB (2):** Verbs
|
24 |
+
- **ADJ (3):** Adjectives
|
25 |
+
- **ADP (4):** Adpositions (Prepositions)
|
26 |
+
- **PRON (5):** Pronouns
|
27 |
+
- **PROPN (6):** Proper nouns
|
28 |
+
- **ADV (7):** Adverbs
|
29 |
+
- **DET (8):** Determiners
|
30 |
+
- **AUX (9):** Auxiliary verbs
|
31 |
+
- **CCONJ (10):** Coordinating conjunctions
|
32 |
+
- **PART (11):** Particles
|
33 |
+
- **SCONJ (12):** Subordinating conjunctions
|
34 |
+
- **NUM (13):** Numerals
|
35 |
+
|
36 |
+
Unused tags:
|
37 |
+
- **X**
|
38 |
+
- **INTJ**
|
39 |
+
- **SYM**
|
40 |
+
|
41 |
+
## Dataset Details
|
42 |
+
The UD Slovak SNK dataset contains annotated Slovak texts that we adapted for this task, fine-tuning the model for POS tagging. The dataset provides UPOS tags for each token, which allowed us to refine our model for accurate recognition and categorization of parts of speech in the Slovak language. The total number of sequences in the data set we used is **9,847**.
|
43 |
+
|
44 |
+
## Fine-Tuning Hyperparameters
|
45 |
+
|
46 |
+
The following hyperparameters were used during the fine-tuning process:
|
47 |
+
|
48 |
+
- **Learning Rate:** 3e-05
|
49 |
+
- **Training Batch Size:** 64
|
50 |
+
- **Evaluation Batch Size:** 64
|
51 |
+
- **Seed:** 42
|
52 |
+
- **Optimizer:** Adam (default)
|
53 |
+
- **Number of Epochs:** 10
|
54 |
+
|
55 |
+
## Model Performance
|
56 |
+
|
57 |
+
The model was evaluated using stratified 10-fold cross-validation, achieving a weighted F1-score with a median value of <span style="font-size: 24px;">**0.982**</span>.
|
58 |
+
|
59 |
+
## Model Usage
|
60 |
+
|
61 |
+
This model is suitable for tokenization and POS tagging of Slovak text. It is specifically designed for applications requiring accurate categorization of parts of speech in various texts.
|
62 |
+
|
63 |
+
### Example Usage
|
64 |
+
|
65 |
+
Below is an example of how to use the fine-tuned `SK_Morph_BLM-pos ` model in a Python script:
|
66 |
+
|
67 |
+
```python
|
68 |
+
import torch
|
69 |
+
from transformers import RobertaForTokenClassification
|
70 |
+
from huggingface_hub import hf_hub_download, snapshot_download
|
71 |
+
import json
|
72 |
+
|
73 |
+
class TokenClassifier:
|
74 |
+
def __init__(self, model, tokenizer):
|
75 |
+
self.model = RobertaForTokenClassification.from_pretrained(model, num_labels=14)
|
76 |
+
|
77 |
+
repo_path = snapshot_download(repo_id = tokenizer)
|
78 |
+
sys.path.append(repo_path)
|
79 |
+
|
80 |
+
# Import the custom tokenizer from the downloaded repository
|
81 |
+
from SKMT_lib_v2.SKMT_BPE import SKMorfoTokenizer
|
82 |
+
self.tokenizer = SKMorfoTokenizer()
|
83 |
+
|
84 |
+
# Stiahnutie a načítanie JSON súboru s mapovaním
|
85 |
+
byte_utf8_mapping_path = hf_hub_download(repo_id=tokenizer, filename="byte_utf8_mapping.json")
|
86 |
+
with open(byte_utf8_mapping_path, "r", encoding="utf-8") as f:
|
87 |
+
self.byte_utf8_mapping = json.load(f)
|
88 |
+
|
89 |
+
def decode(self, tokens):
|
90 |
+
decoded_tokens = []
|
91 |
+
for token in tokens:
|
92 |
+
for k, v in self.byte_utf8_mapping.items():
|
93 |
+
if k in token:
|
94 |
+
token = token.replace(k, v)
|
95 |
+
token = token.replace("Ġ"," ")
|
96 |
+
decoded_tokens.append(token)
|
97 |
+
return decoded_tokens
|
98 |
+
|
99 |
+
def tokenize_text(self, text):
|
100 |
+
encoded_text = self.tokenizer.tokenize(text.lower(), max_length=256, return_tensors='pt', return_subword=False)
|
101 |
+
return encoded_text
|
102 |
+
|
103 |
+
def classify_tokens(self, text):
|
104 |
+
encoded_text = self.tokenize_text(text)
|
105 |
+
tokens = self.tokenizer.convert_list_ids_to_tokens(encoded_text['input_ids'].squeeze().tolist())
|
106 |
+
|
107 |
+
with torch.no_grad():
|
108 |
+
output = self.model(**encoded_text)
|
109 |
+
logits = output.logits
|
110 |
+
predictions = torch.argmax(logits, dim=-1)
|
111 |
+
|
112 |
+
# Použitie masky založenej na attention mask
|
113 |
+
active_loss = encoded_text['attention_mask'].view(-1) == 1
|
114 |
+
active_logits = logits.view(-1, self.model.config.num_labels)[active_loss]
|
115 |
+
active_predictions = predictions.view(-1)[active_loss]
|
116 |
+
|
117 |
+
probabilities = torch.softmax(active_logits, dim=-1)
|
118 |
+
|
119 |
+
results = []
|
120 |
+
for token, pred, prob in zip(self.decode(tokens), active_predictions.tolist(), probabilities.tolist()):
|
121 |
+
if token not in ['<s>', '</s>', '<pad>']:
|
122 |
+
result = f"Token: {token: <10} POS tag: ({self.model.config.id2label[pred]} = {max(prob):.4f})"
|
123 |
+
results.append(result)
|
124 |
+
|
125 |
+
return results
|
126 |
+
|
127 |
+
# Inštancia POS token klasifikátora s použitím špecifikovaného tokenizéra a modelu
|
128 |
+
classifier = TokenClassifier(tokenizer="daviddrzik/SK_Morph_BLM", model="daviddrzik/SK_Morph_BLM-pos")
|
129 |
+
|
130 |
+
# Text na klasifikáciu
|
131 |
+
text_to_classify = "Od učenia ešte nikto nezomrel, ale načo riskovať."
|
132 |
+
|
133 |
+
# Klasifikácia tokenov tokenizovaného textu
|
134 |
+
classification_results = classifier.classify_tokens(text_to_classify)
|
135 |
+
print(f"============= POS Token Classification =============")
|
136 |
+
print("Text to classify:", text_to_classify)
|
137 |
+
for classification_result in classification_results:
|
138 |
+
print(classification_result)
|
139 |
+
```
|
140 |
+
|
141 |
+
Example Output
|
142 |
+
Here is the output when running the above example:
|
143 |
+
```yaml
|
144 |
+
============= POS Token Classification =============
|
145 |
+
Text to classify: Od učenia ešte nikto nezomrel, ale načo riskovať.
|
146 |
+
Token: od POS tag: (ADP = 0.9976)
|
147 |
+
Token: učenia POS tag: (NOUN = 0.9891)
|
148 |
+
Token: ešte POS tag: (PART = 0.9775)
|
149 |
+
Token: nikto POS tag: (PRON = 0.8371)
|
150 |
+
Token: nezomr POS tag: (VERB = 0.9961)
|
151 |
+
Token: el POS tag: (VERB = 0.9917)
|
152 |
+
Token: , POS tag: (PUNCT = 0.9990)
|
153 |
+
Token: ale POS tag: (CCONJ = 0.9914)
|
154 |
+
Token: načo POS tag: (ADV = 0.9188)
|
155 |
+
Token: riskovať POS tag: (VERB = 0.9955)
|
156 |
+
Token: . POS tag: (PUNCT = 0.9991)
|
157 |
+
```
|
158 |
+
|