daviddrzik commited on
Commit
74e3df1
·
verified ·
1 Parent(s): 0219222

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +147 -0
README.md CHANGED
@@ -1,3 +1,150 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - sk
5
+ pipeline_tag: token-classification
6
+ library_name: transformers
7
+ metrics:
8
+ - f1
9
+ base_model: daviddrzik/SK_BPE_BLM
10
+ tags:
11
+ - pos-tagging
12
+ datasets:
13
+ - universal-dependencies/universal_dependencies
14
  ---
15
+
16
+ # Fine-Tuned POS Tagging Model - SK_BPE_BLM (POS Tags)
17
+
18
+ ## Model Overview
19
+ This model is a fine-tuned version of the [SK_BPE_BLM model](https://huggingface.co/daviddrzik/SK_BPE_BLM) for tokenization and POS tagging. For this task, we used the [UD Slovak SNK dataset](https://github.com/UniversalDependencies/UD_Slovak-SNK), which is part of the Universal Dependencies project. This dataset contains annotated Slovak texts with various linguistic information, including UPOS tags, morphological features, syntactic relations, and lemmatization. We focused on UPOS tags, which provide basic categories of parts of speech.
20
+
21
+ ## POS Tags
22
+ Each token in the dataset is annotated with one of the following POS tags:
23
+ - **NOUN (0):** Nouns
24
+ - **PUNCT (1):** Punctuation marks
25
+ - **VERB (2):** Verbs
26
+ - **ADJ (3):** Adjectives
27
+ - **ADP (4):** Adpositions (Prepositions)
28
+ - **PRON (5):** Pronouns
29
+ - **PROPN (6):** Proper nouns
30
+ - **ADV (7):** Adverbs
31
+ - **DET (8):** Determiners
32
+ - **AUX (9):** Auxiliary verbs
33
+ - **CCONJ (10):** Coordinating conjunctions
34
+ - **PART (11):** Particles
35
+ - **SCONJ (12):** Subordinating conjunctions
36
+ - **NUM (13):** Numerals
37
+
38
+ Unused tags:
39
+ - **X**
40
+ - **INTJ**
41
+ - **SYM**
42
+
43
+ ## Dataset Details
44
+ The UD Slovak SNK dataset contains annotated Slovak texts that we adapted for this task, fine-tuning the model for POS tagging. The dataset provides UPOS tags for each token, which allowed us to refine our model for accurate recognition and categorization of parts of speech in the Slovak language. The total number of sequences in the data set we used is **9,847**.
45
+
46
+ ## Fine-Tuning Hyperparameters
47
+
48
+ The following hyperparameters were used during the fine-tuning process:
49
+
50
+ - **Learning Rate:** 3e-05
51
+ - **Training Batch Size:** 64
52
+ - **Evaluation Batch Size:** 64
53
+ - **Seed:** 42
54
+ - **Optimizer:** Adam (default)
55
+ - **Number of Epochs:** 10
56
+
57
+ ## Model Performance
58
+
59
+ The model was evaluated using stratified 10-fold cross-validation, achieving a weighted F1-score with a median value of <span style="font-size: 24px;">**0.979**</span>.
60
+
61
+ ## Model Usage
62
+
63
+ This model is suitable for tokenization and POS tagging of Slovak text. It is specifically designed for applications requiring accurate categorization of parts of speech in various texts.
64
+
65
+ ### Example Usage
66
+
67
+ Below is an example of how to use the fine-tuned `SK_Morph_BLM-pos ` model in a Python script:
68
+
69
+ ```python
70
+ import torch
71
+ from transformers import RobertaForTokenClassification, RobertaTokenizerFast
72
+ from huggingface_hub import hf_hub_download
73
+ import json
74
+
75
+ class TokenClassifier:
76
+ def __init__(self, model, tokenizer):
77
+ self.model = RobertaForTokenClassification.from_pretrained(model, num_labels=14)
78
+ self.tokenizer = RobertaTokenizerFast.from_pretrained(tokenizer, max_length=256)
79
+ byte_utf8_mapping_path = hf_hub_download(repo_id=tokenizer, filename="byte_utf8_mapping.json", token=token)
80
+ with open(byte_utf8_mapping_path, "r", encoding="utf-8") as f:
81
+ self.byte_utf8_mapping = json.load(f)
82
+
83
+ def decode(self, tokens):
84
+ decoded_tokens = []
85
+ for token in tokens:
86
+ for k, v in self.byte_utf8_mapping.items():
87
+ if k in token:
88
+ token = token.replace(k, v)
89
+ token = token.replace("Ġ"," ")
90
+ decoded_tokens.append(token)
91
+ return decoded_tokens
92
+
93
+ def tokenize_text(self, text):
94
+ encoded_text = self.tokenizer(text.lower(), max_length=256, padding='max_length', truncation=True, return_tensors='pt')
95
+ return encoded_text
96
+
97
+ def classify_tokens(self, text):
98
+ encoded_text = self.tokenize_text(text)
99
+ tokens = self.tokenizer.convert_ids_to_tokens(encoded_text['input_ids'].squeeze().tolist())
100
+
101
+ with torch.no_grad():
102
+ output = self.model(**encoded_text)
103
+ logits = output.logits
104
+ predictions = torch.argmax(logits, dim=-1)
105
+
106
+ active_loss = encoded_text['attention_mask'].view(-1) == 1
107
+ active_logits = logits.view(-1, self.model.config.num_labels)[active_loss]
108
+ active_predictions = predictions.view(-1)[active_loss]
109
+
110
+ probabilities = torch.softmax(active_logits, dim=-1)
111
+
112
+ results = []
113
+ for token, pred, prob in zip(self.decode(tokens), active_predictions.tolist(), probabilities.tolist()):
114
+ if token not in ['<s>', '</s>', '<pad>']:
115
+ result = f"Token: {token: <10} POS tag: ({self.model.config.id2label[pred]} = {max(prob):.4f})"
116
+ results.append(result)
117
+
118
+ return results
119
+
120
+ # Instantiate the POS token classifier with the specified tokenizer and model
121
+ classifier = TokenClassifier(tokenizer="daviddrzik/SK_BPE_BLM", model="daviddrzik/SK_BPE_BLM-pos")
122
+
123
+ # Tokenize the input text
124
+ text_to_classify = "Od učenia ešte nikto nezomrel, ale načo riskovať."
125
+
126
+ # Classify the tokens of the tokenized text
127
+ classification_results = classifier.classify_tokens(text_to_classify)
128
+ print(f"============= POS Token Classification =============")
129
+ print("Text to classify:", text_to_classify)
130
+ for classification_result in classification_results:
131
+ print(classification_result)
132
+ ```
133
+
134
+ Example Output
135
+ Here is the output when running the above example:
136
+ ```yaml
137
+ ============= POS Token Classification =============
138
+ Text to classify: Od učenia ešte nikto nezomrel, ale načo riskovať.
139
+ Token: od POS tag: (ADP = 0.9984)
140
+ Token: učenia POS tag: (NOUN = 0.9952)
141
+ Token: ešte POS tag: (PART = 0.9720)
142
+ Token: nikto POS tag: (PRON = 0.9947)
143
+ Token: nezom POS tag: (VERB = 0.9973)
144
+ Token: rel POS tag: (VERB = 0.9950)
145
+ Token: , POS tag: (PUNCT = 0.9992)
146
+ Token: ale POS tag: (CCONJ = 0.9981)
147
+ Token: načo POS tag: (ADV = 0.9804)
148
+ Token: riskovať POS tag: (VERB = 0.9948)
149
+ Token: . POS tag: (PUNCT = 0.9994)
150
+ ```