|
# Turkish Question Answering model based on mt0-large |
|
In this model, I fine-tuned *mT0-large* model with the following Turkish QA datasets where the training set size is around 11K QAs. |
|
|
|
* https://huggingface.co/bigscience/mt0-large |
|
* https://github.com/okanvk/Turkish-Reading-Comprehension-Question-Answering-Dataset |
|
|
|
The model is tuned within one of parameter-efficient fine-tuning techniques, which is PEFT LORA. So we need to install peft modules. Please check |
|
* https://github.com/huggingface/peft |
|
|
|
|
|
# FOR INFERENCE |
|
|
|
We first install the requirements : |
|
``` |
|
!pip install peft transformers # in jupyter notebook |
|
|
|
``` |
|
|
|
## SINGLE INFERENCE |
|
|
|
``` |
|
from peft import PeftModel, PeftConfig |
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
peft_model_path="savasy/mt0-large-Turkish-qa" |
|
|
|
config = PeftConfig.from_pretrained(peft_model_path) |
|
model = AutoModelForSeq2SeqLM.from_pretrained( |
|
config.base_model_name_or_path) |
|
|
|
#tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path) |
|
|
|
# Wrap the model with the LORA |
|
inference_model = PeftModel.from_pretrained(model, peft_model_path) |
|
inference_model.eval() |
|
|
|
inference_model.to("cuda") |
|
test_input = ''' |
|
Mustafa adını babası Ali Rıza Efendi kendi dedesinin adı olduğundan dolayı vermiştir. Çünkü Ali Rıza Efendi'nin babasının adı olan |
|
Ahmed adı ağabeylerinden birisine verilmişti. Mustafa'ya neden Kemal isminin verildiğine yönelik ise çeşitli iddialar vardır. |
|
Afet İnan, bu ismi ona matematik öğretmeni Üsküplü Mustafa Efendi'nin Kemal adının anlamında olduğu gibi onun "mükemmel ve olgun" |
|
olduğunu göstermek için verdiğini söylemiştir. (source: wikipedia) . |
|
Mustafa'nın dedesinin ismi nedir ? |
|
''' |
|
import torch |
|
with torch.no_grad(): |
|
inputs = tokenizer(test_input, return_tensors="pt", padding=True).to("cuda") |
|
generated_ids = inference_model.generate(**inputs) |
|
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) |
|
outputs |
|
-> [Ahmed] |
|
``` |
|
|
|
## BATCH INFERENCE |
|
|
|
The usage for batch mode is as follows: |
|
|
|
``` |
|
from peft import PeftModel, PeftConfig |
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
peft_model_path="savasy/mt0-large-Turkish-qa" |
|
|
|
config = PeftConfig.from_pretrained(peft_model_path) |
|
model = AutoModelForSeq2SeqLM.from_pretrained( |
|
config.base_model_name_or_path) |
|
|
|
#tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path) |
|
|
|
# Wrap the model with the LORA |
|
inference_model = PeftModel.from_pretrained(model, peft_model_path) |
|
inference_model.eval() |
|
|
|
inference_model.to("cuda") |
|
|
|
test_inputs = ["","","",""] # a list of texts. A text must have Content followed by a Question |
|
|
|
preds=[] # for predictions |
|
data_loader= DataLoader(test_inputs,batch_size=8) |
|
from tqdm import tqdm |
|
|
|
with torch.no_grad(): |
|
for batch in tqdm(data_loader): |
|
inputs = tokenizer(batch, return_tensors="pt", padding=True).to("cuda") |
|
generated_ids = inference_model.generate(**inputs) |
|
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) |
|
preds+=outputs |
|
``` |
|
In the end, you can compare *preds* (predictions) with your expected ground-truth results. You can use many metrics for this such as BLEU, ROUGE, Exact Match, Partial Match |
|
|
|
|