daviddrzik
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,138 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- sk
|
5 |
+
pipeline_tag: question-answering
|
6 |
+
library_name: transformers
|
7 |
+
metrics:
|
8 |
+
- f1
|
9 |
+
- levenshtein
|
10 |
+
- exact match
|
11 |
+
base_model: daviddrzik/SK_Morph_BLM
|
12 |
+
tags:
|
13 |
+
- question-answering
|
14 |
+
- sk-quad
|
15 |
+
datasets:
|
16 |
+
- TUKE-DeutscheTelekom/skquad
|
17 |
+
---
|
18 |
+
|
19 |
+
# Fine-Tuned Question Answering Model - SK_Morph_BLM (SK-QuAD Dataset)
|
20 |
+
|
21 |
+
## Model Overview
|
22 |
+
This model is a fine-tuned version of the [SK_Morph_BLM model](https://huggingface.co/daviddrzik/SK_Morph_BLM) for extractive question answering tasks. The fine-tuning was conducted using the [SK-QuAD dataset](https://nlp.kemt.fei.tuke.sk/language/skquad), which is the first manually annotated dataset for Slovak, containing over 91,000 questions and answers. This dataset includes both clearly answerable questions and unanswerable ones, as well as plausible but probably incorrect answers.
|
23 |
+
|
24 |
+
## Dataset Details
|
25 |
+
For the purposes of fine-tuning, we focused solely on the records with clearly answerable questions. The original dataset was divided into training and test sets; however, we combined these into a single dataset for our research. Some records had extensive contexts that, when combined with the question, exceeded the context window size of our model. We therefore excluded all records where the combined length of the context and question exceeded 1,300 characters, which corresponds to approximately 256 tokens. This reduction resulted in a final dataset size of **54,319** question-answer pairs.
|
26 |
+
To ensure robust evaluation, we applied stratified 10-fold cross-validation across the dataset. This approach allowed us to rigorously assess the model's performance and generalize well across different subsets of the data.
|
27 |
+
|
28 |
+
## Fine-Tuning Hyperparameters
|
29 |
+
The following hyperparameters were used during the fine-tuning process:
|
30 |
+
|
31 |
+
- **Learning Rate:** 5e-05
|
32 |
+
- **Training Batch Size:** 64 sequences
|
33 |
+
- **Evaluation Batch Size:** 64 sequences
|
34 |
+
- **Seed:** 42
|
35 |
+
- **Optimizer:** Adam (default)
|
36 |
+
- **Number of Epochs:** 5
|
37 |
+
|
38 |
+
## Evaluation Metrics
|
39 |
+
|
40 |
+
The model performance was assessed using both token-level and text-level metrics:
|
41 |
+
|
42 |
+
- **Token-Level Metrics:**
|
43 |
+
- **Precision**
|
44 |
+
- **Recall**
|
45 |
+
- **F1-Score:** Measures how accurately the model identified the correct answer tokens within the context.
|
46 |
+
|
47 |
+
- **Text-Level Metrics:**
|
48 |
+
- **Levenshtein Distance:** Evaluates the similarity between the predicted and correct answers.
|
49 |
+
- **Exact Match:** Measures the percentage of answers where the predicted answer exactly matched the correct one.
|
50 |
+
|
51 |
+
## Model Performance
|
52 |
+
|
53 |
+
The model achieved the following median performance metrics:
|
54 |
+
|
55 |
+
- **F1-Score:** 0.6768
|
56 |
+
- **Levenshtein Distance:** 0.6535
|
57 |
+
- **Exact Match:** 0.3791
|
58 |
+
|
59 |
+
## Model Usage
|
60 |
+
|
61 |
+
This model is suitable for extractive question answering tasks in Slovak text, particularly for applications that require the identification of precise answers from a given context.
|
62 |
+
|
63 |
+
### Example Usage
|
64 |
+
|
65 |
+
Below is an example of how to use the fine-tuned `SK_Morph_BLM-qa` model in a Python script:
|
66 |
+
|
67 |
+
```python
|
68 |
+
import torch
|
69 |
+
from torch.nn.functional import softmax
|
70 |
+
from transformers import RobertaForQuestionAnswering
|
71 |
+
from huggingface_hub import snapshot_download
|
72 |
+
import sys
|
73 |
+
import json
|
74 |
+
|
75 |
+
class QuestionAnsweringModel:
|
76 |
+
def __init__(self, model, tokenizer):
|
77 |
+
self.model = RobertaForQuestionAnswering.from_pretrained(model)
|
78 |
+
|
79 |
+
repo_path = snapshot_download(repo_id=tokenizer)
|
80 |
+
sys.path.append(repo_path)
|
81 |
+
|
82 |
+
from SKMT_lib_v2.SKMT_BPE import SKMorfoTokenizer
|
83 |
+
self.tokenizer = SKMorfoTokenizer()
|
84 |
+
|
85 |
+
def decode(self, tensor):
|
86 |
+
result = "".join(self.tokenizer.convert_list_ids_to_tokens(tensor.tolist()))
|
87 |
+
result = result.replace("Ġ", " ").strip()
|
88 |
+
return result
|
89 |
+
|
90 |
+
def predict(self, context, question):
|
91 |
+
inputs = self.tokenizer.tokenizeQA(context, question, max_length=256, return_tensors="pt", return_subword=False)
|
92 |
+
input_ids = inputs["input_ids"][0]
|
93 |
+
|
94 |
+
outputs = self.model(**inputs)
|
95 |
+
start_logits = outputs.start_logits
|
96 |
+
end_logits = outputs.end_logits
|
97 |
+
|
98 |
+
start_probs = softmax(start_logits, dim=1)
|
99 |
+
end_probs = softmax(end_logits, dim=1)
|
100 |
+
|
101 |
+
answer_start = torch.argmax(start_probs)
|
102 |
+
answer_end = torch.argmax(end_probs) + 1
|
103 |
+
|
104 |
+
answer = self.decode(input_ids[answer_start:answer_end])
|
105 |
+
|
106 |
+
start_prob = start_probs[0, answer_start].item()
|
107 |
+
end_prob = end_probs[0, answer_end - 1].item()
|
108 |
+
|
109 |
+
return answer, start_prob, end_prob
|
110 |
+
|
111 |
+
# Instantiate the QA model with the specified tokenizer and model
|
112 |
+
qa_model = QuestionAnsweringModel(tokenizer="daviddrzik/SK_Morph_BLM", model="daviddrzik/SK_Morph_BLM-qa")
|
113 |
+
|
114 |
+
context = "Albert Einstein, narodený v roku 1879, je jedným z najvplyvnejších fyzikov všetkých čias. Vyvinul teóriu relativity, ktorá zmenila naše chápanie priestoru, času a gravitácie. Jeho slávna rovnica E = mc², ktorá vyjadruje vzťah medzi energiou a hmotou, je považovaná za jednu z najvýznamnejších rovníc vo fyzike. Einstein získal Nobelovu cenu za fyziku v roku 1921 za jeho prácu na fotoelektrickom jave, ktorý bol kľúčový pre rozvoj kvantovej mechaniky."
|
115 |
+
question = "V ktorom roku získal Albert Einstein Nobelovu cenu za fyziku?"
|
116 |
+
|
117 |
+
print("\nContext: " + context + "\n")
|
118 |
+
print("Question: " + question + "\n")
|
119 |
+
|
120 |
+
# Predict the answer
|
121 |
+
answer = qa_model.predict(context, question)
|
122 |
+
print(f"Predicted answer: {answer}")
|
123 |
+
```
|
124 |
+
|
125 |
+
Example Output
|
126 |
+
Here is the output when running the above example:
|
127 |
+
```yaml
|
128 |
+
Context: Albert Einstein, narodený v roku 1879, je jedným z najvplyvnejších fyzikov všetkých čias.
|
129 |
+
Vyvinul teóriu relativity, ktorá zmenila naše chápanie priestoru, času a gravitácie.
|
130 |
+
Jeho slávna rovnica E = mc², ktorá vyjadruje vzťah medzi energiou a hmotou,
|
131 |
+
je považovaná za jednu z najvýznamnejších rovníc vo fyzike.
|
132 |
+
Einstein získal Nobelovu cenu za fyziku v roku 1921 za jeho prácu na fotoelektrickom jave,
|
133 |
+
ktorý bol kľúčový pre rozvoj kvantovej mechaniky.
|
134 |
+
|
135 |
+
Question: V ktorom roku získal Albert Einstein Nobelovu cenu za fyziku?
|
136 |
+
|
137 |
+
Predicted answer: ('v roku 1921', 0.7977392673492432, 0.9985119700431824)
|
138 |
+
```
|