daviddrzik commited on
Commit
bd682bf
·
verified ·
1 Parent(s): 0cf20aa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +138 -3
README.md CHANGED
@@ -1,3 +1,138 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - sk
5
+ pipeline_tag: question-answering
6
+ library_name: transformers
7
+ metrics:
8
+ - f1
9
+ - levenshtein
10
+ - exact match
11
+ base_model: daviddrzik/SK_Morph_BLM
12
+ tags:
13
+ - question-answering
14
+ - sk-quad
15
+ datasets:
16
+ - TUKE-DeutscheTelekom/skquad
17
+ ---
18
+
19
+ # Fine-Tuned Question Answering Model - SK_Morph_BLM (SK-QuAD Dataset)
20
+
21
+ ## Model Overview
22
+ This model is a fine-tuned version of the [SK_Morph_BLM model](https://huggingface.co/daviddrzik/SK_Morph_BLM) for extractive question answering tasks. The fine-tuning was conducted using the [SK-QuAD dataset](https://nlp.kemt.fei.tuke.sk/language/skquad), which is the first manually annotated dataset for Slovak, containing over 91,000 questions and answers. This dataset includes both clearly answerable questions and unanswerable ones, as well as plausible but probably incorrect answers.
23
+
24
+ ## Dataset Details
25
+ For the purposes of fine-tuning, we focused solely on the records with clearly answerable questions. The original dataset was divided into training and test sets; however, we combined these into a single dataset for our research. Some records had extensive contexts that, when combined with the question, exceeded the context window size of our model. We therefore excluded all records where the combined length of the context and question exceeded 1,300 characters, which corresponds to approximately 256 tokens. This reduction resulted in a final dataset size of **54,319** question-answer pairs.
26
+ To ensure robust evaluation, we applied stratified 10-fold cross-validation across the dataset. This approach allowed us to rigorously assess the model's performance and generalize well across different subsets of the data.
27
+
28
+ ## Fine-Tuning Hyperparameters
29
+ The following hyperparameters were used during the fine-tuning process:
30
+
31
+ - **Learning Rate:** 5e-05
32
+ - **Training Batch Size:** 64 sequences
33
+ - **Evaluation Batch Size:** 64 sequences
34
+ - **Seed:** 42
35
+ - **Optimizer:** Adam (default)
36
+ - **Number of Epochs:** 5
37
+
38
+ ## Evaluation Metrics
39
+
40
+ The model performance was assessed using both token-level and text-level metrics:
41
+
42
+ - **Token-Level Metrics:**
43
+ - **Precision**
44
+ - **Recall**
45
+ - **F1-Score:** Measures how accurately the model identified the correct answer tokens within the context.
46
+
47
+ - **Text-Level Metrics:**
48
+ - **Levenshtein Distance:** Evaluates the similarity between the predicted and correct answers.
49
+ - **Exact Match:** Measures the percentage of answers where the predicted answer exactly matched the correct one.
50
+
51
+ ## Model Performance
52
+
53
+ The model achieved the following median performance metrics:
54
+
55
+ - **F1-Score:** 0.6768
56
+ - **Levenshtein Distance:** 0.6535
57
+ - **Exact Match:** 0.3791
58
+
59
+ ## Model Usage
60
+
61
+ This model is suitable for extractive question answering tasks in Slovak text, particularly for applications that require the identification of precise answers from a given context.
62
+
63
+ ### Example Usage
64
+
65
+ Below is an example of how to use the fine-tuned `SK_Morph_BLM-qa` model in a Python script:
66
+
67
+ ```python
68
+ import torch
69
+ from torch.nn.functional import softmax
70
+ from transformers import RobertaForQuestionAnswering
71
+ from huggingface_hub import snapshot_download
72
+ import sys
73
+ import json
74
+
75
+ class QuestionAnsweringModel:
76
+ def __init__(self, model, tokenizer):
77
+ self.model = RobertaForQuestionAnswering.from_pretrained(model)
78
+
79
+ repo_path = snapshot_download(repo_id=tokenizer)
80
+ sys.path.append(repo_path)
81
+
82
+ from SKMT_lib_v2.SKMT_BPE import SKMorfoTokenizer
83
+ self.tokenizer = SKMorfoTokenizer()
84
+
85
+ def decode(self, tensor):
86
+ result = "".join(self.tokenizer.convert_list_ids_to_tokens(tensor.tolist()))
87
+ result = result.replace("Ġ", " ").strip()
88
+ return result
89
+
90
+ def predict(self, context, question):
91
+ inputs = self.tokenizer.tokenizeQA(context, question, max_length=256, return_tensors="pt", return_subword=False)
92
+ input_ids = inputs["input_ids"][0]
93
+
94
+ outputs = self.model(**inputs)
95
+ start_logits = outputs.start_logits
96
+ end_logits = outputs.end_logits
97
+
98
+ start_probs = softmax(start_logits, dim=1)
99
+ end_probs = softmax(end_logits, dim=1)
100
+
101
+ answer_start = torch.argmax(start_probs)
102
+ answer_end = torch.argmax(end_probs) + 1
103
+
104
+ answer = self.decode(input_ids[answer_start:answer_end])
105
+
106
+ start_prob = start_probs[0, answer_start].item()
107
+ end_prob = end_probs[0, answer_end - 1].item()
108
+
109
+ return answer, start_prob, end_prob
110
+
111
+ # Instantiate the QA model with the specified tokenizer and model
112
+ qa_model = QuestionAnsweringModel(tokenizer="daviddrzik/SK_Morph_BLM", model="daviddrzik/SK_Morph_BLM-qa")
113
+
114
+ context = "Albert Einstein, narodený v roku 1879, je jedným z najvplyvnejších fyzikov všetkých čias. Vyvinul teóriu relativity, ktorá zmenila naše chápanie priestoru, času a gravitácie. Jeho slávna rovnica E = mc², ktorá vyjadruje vzťah medzi energiou a hmotou, je považovaná za jednu z najvýznamnejších rovníc vo fyzike. Einstein získal Nobelovu cenu za fyziku v roku 1921 za jeho prácu na fotoelektrickom jave, ktorý bol kľúčový pre rozvoj kvantovej mechaniky."
115
+ question = "V ktorom roku získal Albert Einstein Nobelovu cenu za fyziku?"
116
+
117
+ print("\nContext: " + context + "\n")
118
+ print("Question: " + question + "\n")
119
+
120
+ # Predict the answer
121
+ answer = qa_model.predict(context, question)
122
+ print(f"Predicted answer: {answer}")
123
+ ```
124
+
125
+ Example Output
126
+ Here is the output when running the above example:
127
+ ```yaml
128
+ Context: Albert Einstein, narodený v roku 1879, je jedným z najvplyvnejších fyzikov všetkých čias.
129
+ Vyvinul teóriu relativity, ktorá zmenila naše chápanie priestoru, času a gravitácie.
130
+ Jeho slávna rovnica E = mc², ktorá vyjadruje vzťah medzi energiou a hmotou,
131
+ je považovaná za jednu z najvýznamnejších rovníc vo fyzike.
132
+ Einstein získal Nobelovu cenu za fyziku v roku 1921 za jeho prácu na fotoelektrickom jave,
133
+ ktorý bol kľúčový pre rozvoj kvantovej mechaniky.
134
+
135
+ Question: V ktorom roku získal Albert Einstein Nobelovu cenu za fyziku?
136
+
137
+ Predicted answer: ('v roku 1921', 0.7977392673492432, 0.9985119700431824)
138
+ ```