SaraPiscitelli
/

roberta-base-qa-v1

@@ -22,38 +22,107 @@ You can access the training code [here](https://github.com/sarapiscitelli/nlp-ta
 - **Language(s) (NLP):** English
 - **License:** Apache 2.0
 - **Finetuned from model:** [roberta-base](https://huggingface.co/roberta-base)
 ### Model Sources
 - **training code:** [here](https://github.com/sarapiscitelli/nlp-tasks/blob/main/scripts/train/question_answering.py)
 - **evaluation code:** [here](https://github.com/sarapiscitelli/nlp-tasks/blob/main/scripts/evaluation/question_answering.py).
 ## Training Details
 ### Training Data
-Train Dataset({
-    features: ['id', 'title', 'context', 'question', 'answers'],
-    num_rows: 8207
-})
-Eval dataset:
-Dataset({
-    features: ['id', 'title', 'context', 'question', 'answers'],
-    num_rows: 637
-})
-Dataset:
 squad = load_dataset("squad")
 squad['train'] = squad['train'].select(range(30000))
 squad['test'] = squad['validation']
 squad['validation'] = squad['validation'].select(range(2000))
-### Training Procedure
 #### Preprocessing
-max-tokens-length = 512
 #### Training Hyperparameters
@@ -88,34 +157,40 @@ max-tokens-length = 512
     bf16=False
 )
-## Evaluation
-Evaluation Dataset:
-Dataset({
-    features: ['id', 'title', 'context', 'question', 'answers'],
-    num_rows: 10570
-})
-Max Tokens Length:
-512
-Evaluation Metrics:
-{'exact': 66.00660066006601, 'f1': 78.28040573606134, 'total': 909, 'HasAns_exact': 66.00660066006601, 'HasAns_f1': 78.28040573606134, 'HasAns_total': 909, 'best_exact': 66.00660066006601, 'best_exact_thresh': 0.0, 'best_f1': 78.28040573606134, 'best_f1_thresh': 0.0}
 ### Testing Data, Factors & Metrics
 #### Testing Data
 squad = load_dataset("squad")
-squad['test'] = squad['validation']
-Dataset({
-    features: ['id', 'title', 'context', 'question', 'answers'],
-    num_rows: 10570
 })
 #### Metrics
 metric_eval = evaluate.load("squad_v2")
 ### Results
-{'exact': 66.00660066006601, 'f1': 78.28040573606134, 'total': 909, 'HasAns_exact': 66.00660066006601, 'HasAns_f1': 78.28040573606134, 'HasAns_total': 909, 'best_exact': 66.00660066006601, 'best_exact_thresh': 0.0, 'best_f1': 78.28040573606134, 'best_f1_thresh': 0.0}

 - **Language(s) (NLP):** English
 - **License:** Apache 2.0
 - **Finetuned from model:** [roberta-base](https://huggingface.co/roberta-base)
+- **Maximum input tokens:** 512
 ### Model Sources
 - **training code:** [here](https://github.com/sarapiscitelli/nlp-tasks/blob/main/scripts/train/question_answering.py)
 - **evaluation code:** [here](https://github.com/sarapiscitelli/nlp-tasks/blob/main/scripts/evaluation/question_answering.py).
+## Uses
+The model can be utilized for the extractive question-answering task, where both the context and the question are provide.
+### Recommendations
+This is a basic standard model; some results may be inaccurate.
+Refer to the evaluation metrics for a better understanding of its performance.
+## How to Get Started with the Model
+You can use the Huggingface pipeline:
+```
+# Use a pipeline as a high-level helper
+from transformers import pipeline
+qa_model = pipeline("question-answering", model="SaraPiscitelli/roberta-base-qa-v1")
+question = "Which name is also used to describe the Amazon rainforest in English?"
+context = "The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species."
+print(qa_model(question = question, context = context))
+```
+or load it directly:
+```
+import torch
+from typing import List, Optional
+from tqdm import tqdm
+from transformers import AutoModelForQuestionAnswering, AutoTokenizer
+class InferenceModel:
+    def __init__(self, model_name_or_checkpoin_path: str,
+                 tokenizer_name: Optional[str] = None,
+                 device_type: Optional[str] = None) -> List[str]:
+        if tokenizer_name is None:
+            tokenizer_name = model_name_or_checkpoin_path
+        if device_type is None:
+            device_type = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
+        self.model = AutoModelForQuestionAnswering.from_pretrained(model_name_or_checkpoin_path, device_map=device_type)
+        self.model.eval()
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_checkpoin_path)
+    def inference(self, questions: List[str], contexts: List[str]) -> List[str]:
+        inputs = self.tokenizer(questions, contexts,
+                                padding="longest",
+                                return_tensors="pt").to(self.model.device)
+        with torch.no_grad():
+            logits = self.model(**inputs)
+        # logits.start_logits.shape == (batch_size, input_length) = inputs['input_ids'].shape
+        # logits.end_logits.shape == (batch_size, input_length) = inputs['input_ids'].shape
+        answer_start_index: List[int] = logits.start_logits.argmax(dim=-1).tolist()
+        answer_end_index: List[int] = logits.end_logits.argmax(dim=-1).tolist()
+        answer_tokens: List[str] = [self.tokenizer.decode(inputs.input_ids[i, answer_start_index[i] : answer_end_index[i] + 1])
+                                    for i in range(len(questions))]
+        return answer_tokens
+model = InferenceModel("SaraPiscitelli/roberta-base-qa-v1")
+question = "Which name is also used to describe the Amazon rainforest in English?"
+context = "The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species."
+print(model.inference(questions=[question], contexts=[context])[0])
+```
+In both cases, the answer will be printed out:  "Amazonia or the Amazon Jungle"
 ## Training Details
 ### Training Data
+- [squad dataset](https://huggingface.co/datasets/squad).
+To retrieve the dataset, use the following code:
+```
+from datasets import load_dataset
 squad = load_dataset("squad")
 squad['train'] = squad['train'].select(range(30000))
 squad['test'] = squad['validation']
 squad['validation'] = squad['validation'].select(range(2000))
+```
+The dataset used after preprocessing is listed below:
+- Train Dataset({
+      features: ['id', 'title', 'context', 'question', 'answers'],
+      num_rows: 8207
+  })
+- Validation dataset({
+      features: ['id', 'title', 'context', 'question', 'answers'],
+      num_rows: 637
+  })
 #### Preprocessing
+All samples with **more than 512 tokens have been removed**.
+This was necessary due to the maximum input token limit accepted by the RoBERTa-base model.
 #### Training Hyperparameters
     bf16=False
 )
 ### Testing Data, Factors & Metrics
 #### Testing Data
+To retrieve the dataset, use the following code:
+```
+from datasets import load_dataset
 squad = load_dataset("squad")
+squad['test'] = squad['validation']
+```
+Test Dataset({
+    features: ['id', 'title', 'context', 'question', 'answers'],
+    num_rows: 10570
 })
 #### Metrics
+To evaluate model has been used the standard metric for squad:
+```
+import evaluate
 metric_eval = evaluate.load("squad_v2")
+```
 ### Results
+{'exact-match': 66.00660066006601,
+'f1': 78.28040573606134,
+'total': 909,
+'HasAns_exact': 66.00660066006601,
+'HasAns_f1': 78.28040573606134,
+'HasAns_total': 909,
+'best_exact': 66.00660066006601,
+'best_exact_thresh': 0.0,
+'best_f1': 78.28040573606134,
+'best_f1_thresh': 0.0}