daviddrzik commited on
Commit
f47cebc
·
verified ·
1 Parent(s): f0bb85c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +153 -0
README.md CHANGED
@@ -1,3 +1,156 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - sk
5
+ pipeline_tag: token-classification
6
+ library_name: transformers
7
+ metrics:
8
+ - f1
9
+ base_model: daviddrzik/SK_Morph_BLM
10
+ tags:
11
+ - ner
12
+ datasets:
13
+ - NaiveNeuron/wikigoldsk
14
  ---
15
+
16
+ # Fine-Tuned Named Entity Recognition (NER) Model - SK_Morph_BLM (NER Tags)
17
+
18
+ ## Model Overview
19
+ This model is a fine-tuned version of the [SK_Morph_BLM model](https://huggingface.co/daviddrzik/SK_Morph_BLM) for tokenization and Named Entity Recognition (NER). For this task, we utilized the manually annotated [WikiGoldSK dataset](https://github.com/NaiveNeuron/WikiGoldSK), which was created from 412 articles from the Slovak Wikipedia. The dataset contains annotations for four main categories of entities: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC).
20
+
21
+ ## NER Tags
22
+ Each token in the dataset is annotated with one of the following NER tags:
23
+ - **O (0):** Regular text (not an entity)
24
+ - **B-PER (1):** Beginning of a person entity
25
+ - **I-PER (2):** Continuation of a person entity
26
+ - **B-LOC (3):** Beginning of a location entity
27
+ - **I-LOC (4):** Continuation of a location entity
28
+ - **B-ORG (5):** Beginning of an organization entity
29
+ - **I-ORG (6):** Continuation of an organization entity
30
+ - **B-MISC (7):** Beginning of a miscellaneous entity
31
+ - **I-MISC (8):** Continuation of a miscellaneous entity
32
+
33
+ ## Dataset Details
34
+ The WikiGoldSK dataset, which contains a total of **6,633** sequences, was adapted for this NER task. The dataset was originally split into training, validation, and test sets, but for our research, we combined all parts and evaluated the model using stratified 10-fold cross-validation. Each token in the text, including words and punctuation, was annotated with the appropriate NER tag.
35
+
36
+ ## Fine-Tuning Hyperparameters
37
+
38
+ The following hyperparameters were used during the fine-tuning process:
39
+
40
+ - **Learning Rate:** 3e-05
41
+ - **Training Batch Size:** 64
42
+ - **Evaluation Batch Size:** 64
43
+ - **Seed:** 42
44
+ - **Optimizer:** Adam (default)
45
+ - **Number of Epochs:** 10
46
+
47
+ ## Model Performance
48
+
49
+ The model was evaluated using stratified 10-fold cross-validation, achieving a weighted F1-score with a median value of <span style="font-size: 24px;">**0.9605**</span>.
50
+
51
+ ## Model Usage
52
+
53
+ This model is suitable for tokenization and NER tasks in Slovak text. It is specifically designed for applications requiring accurate identification and categorization of named entities in various Slovak texts.
54
+
55
+ ### Example Usage
56
+
57
+ Below is an example of how to use the fine-tuned `SK_Morph_BLM-ner ` model in a Python script:
58
+
59
+ ```python
60
+ import torch
61
+ from transformers import RobertaForTokenClassification
62
+ from huggingface_hub import hf_hub_download, snapshot_download
63
+ import json
64
+
65
+ class TokenClassifier:
66
+ def __init__(self, model, tokenizer):
67
+ self.model = RobertaForTokenClassification.from_pretrained(model, num_labels=10)
68
+
69
+ repo_path = snapshot_download(repo_id = tokenizer)
70
+ sys.path.append(repo_path)
71
+
72
+ from SKMT_lib_v2.SKMT_BPE import SKMorfoTokenizer
73
+ self.tokenizer = SKMorfoTokenizer()
74
+
75
+ byte_utf8_mapping_path = hf_hub_download(repo_id=tokenizer, filename="byte_utf8_mapping.json")
76
+ with open(byte_utf8_mapping_path, "r", encoding="utf-8") as f:
77
+ self.byte_utf8_mapping = json.load(f)
78
+
79
+ def decode(self, tokens):
80
+ decoded_tokens = []
81
+ for token in tokens:
82
+ for k, v in self.byte_utf8_mapping.items():
83
+ if k in token:
84
+ token = token.replace(k, v)
85
+ token = token.replace("Ġ"," ")
86
+ decoded_tokens.append(token)
87
+ return decoded_tokens
88
+
89
+ def tokenize_text(self, text):
90
+ encoded_text = self.tokenizer.tokenize(text.lower(), max_length=256, return_tensors='pt', return_subword=False)
91
+ return encoded_text
92
+
93
+ def classify_tokens(self, text):
94
+ encoded_text = self.tokenize_text(text)
95
+ tokens = self.tokenizer.convert_list_ids_to_tokens(encoded_text['input_ids'].squeeze().tolist())
96
+
97
+ with torch.no_grad():
98
+ output = self.model(**encoded_text)
99
+ logits = output.logits
100
+ predictions = torch.argmax(logits, dim=-1)
101
+
102
+ # Použitie masky založenej na attention mask
103
+ active_loss = encoded_text['attention_mask'].view(-1) == 1
104
+ active_logits = logits.view(-1, self.model.config.num_labels)[active_loss]
105
+ active_predictions = predictions.view(-1)[active_loss]
106
+
107
+ probabilities = torch.softmax(active_logits, dim=-1)
108
+
109
+ results = []
110
+ for token, pred, prob in zip(self.decode(tokens), active_predictions.tolist(), probabilities.tolist()):
111
+ if token not in ['<s>', '</s>', '<pad>']:
112
+ result = f"Token: {token: <10} NER tag: ({self.model.config.id2label[pred]} = {max(prob):.4f})"
113
+ results.append(result)
114
+
115
+ return results
116
+
117
+ # Instantiate the NER classifier with the specified tokenizer and model
118
+ classifier = TokenClassifier(tokenizer="daviddrzik/SK_Morph_BLM", model="daviddrzik/SK_Morph_BLM-ner")
119
+
120
+ # Tokenize the input text
121
+ text_to_classify = "Dávid Držík je interný doktorand na Fakulte prírodných vied a informatiky UKF v Nitre na Slovensku."
122
+
123
+ # Classify the NER tags of the tokenized text
124
+ classification_results = classifier.classify_tokens(text_to_classify)
125
+ print(f"============= NER Token Classification =============")
126
+ print("Text to classify:", text_to_classify)
127
+ for classification_result in classification_results:
128
+ print(classification_result)
129
+ ```
130
+
131
+ Example Output
132
+ Here is the output when running the above example:
133
+ ```yaml
134
+ ============= NER Token Classification =============
135
+ Text to classify: Dávid Držík je interný doktorand na Fakulte prírodných vied a informatiky UKF v Nitre na Slovensku.
136
+ Token: dávid NER tag: (B-PER = 0.9924)
137
+ Token: drž NER tag: (I-PER = 0.9040)
138
+ Token: ík NER tag: (I-PER = 0.7020)
139
+ Token: je NER tag: (O = 0.9985)
140
+ Token: intern NER tag: (O = 0.9978)
141
+ Token: ý NER tag: (O = 0.9976)
142
+ Token: doktorand NER tag: (O = 0.9986)
143
+ Token: na NER tag: (O = 0.9989)
144
+ Token: fakulte NER tag: (B-ORG = 0.9857)
145
+ Token: prírodných NER tag: (I-ORG = 0.9585)
146
+ Token: vied NER tag: (I-ORG = 0.9905)
147
+ Token: a NER tag: (I-ORG = 0.9607)
148
+ Token: informatiky NER tag: (I-ORG = 0.9773)
149
+ Token: uk NER tag: (I-ORG = 0.9490)
150
+ Token: f NER tag: (I-ORG = 0.9946)
151
+ Token: v NER tag: (I-ORG = 0.9865)
152
+ Token: nitre NER tag: (B-LOC = 0.6015)
153
+ Token: na NER tag: (O = 0.9555)
154
+ Token: slovensku NER tag: (B-LOC = 0.9661)
155
+ Token: . NER tag: (O = 0.9972)
156
+ ```