SaraPiscitelli commited on
Commit
47c836f
1 Parent(s): ac472cf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -30
README.md CHANGED
@@ -22,38 +22,107 @@ You can access the training code [here](https://github.com/sarapiscitelli/nlp-ta
22
  - **Language(s) (NLP):** English
23
  - **License:** Apache 2.0
24
  - **Finetuned from model:** [roberta-base](https://huggingface.co/roberta-base)
 
25
 
26
  ### Model Sources
27
 
28
  - **training code:** [here](https://github.com/sarapiscitelli/nlp-tasks/blob/main/scripts/train/question_answering.py)
29
  - **evaluation code:** [here](https://github.com/sarapiscitelli/nlp-tasks/blob/main/scripts/evaluation/question_answering.py).
30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  ## Training Details
32
 
33
  ### Training Data
 
 
 
 
34
 
35
- Train Dataset({
36
- features: ['id', 'title', 'context', 'question', 'answers'],
37
- num_rows: 8207
38
- })
39
-
40
- Eval dataset:
41
- Dataset({
42
- features: ['id', 'title', 'context', 'question', 'answers'],
43
- num_rows: 637
44
- })
45
-
46
- Dataset:
47
  squad = load_dataset("squad")
48
  squad['train'] = squad['train'].select(range(30000))
49
  squad['test'] = squad['validation']
50
  squad['validation'] = squad['validation'].select(range(2000))
 
51
 
52
- ### Training Procedure
53
 
 
 
 
 
 
 
 
 
 
 
54
  #### Preprocessing
55
 
56
- max-tokens-length = 512
 
57
 
58
  #### Training Hyperparameters
59
 
@@ -88,34 +157,40 @@ max-tokens-length = 512
88
  bf16=False
89
  )
90
 
91
- ## Evaluation
92
-
93
- Evaluation Dataset:
94
- Dataset({
95
- features: ['id', 'title', 'context', 'question', 'answers'],
96
- num_rows: 10570
97
- })
98
- Max Tokens Length:
99
- 512
100
- Evaluation Metrics:
101
- {'exact': 66.00660066006601, 'f1': 78.28040573606134, 'total': 909, 'HasAns_exact': 66.00660066006601, 'HasAns_f1': 78.28040573606134, 'HasAns_total': 909, 'best_exact': 66.00660066006601, 'best_exact_thresh': 0.0, 'best_f1': 78.28040573606134, 'best_f1_thresh': 0.0}
102
 
103
  ### Testing Data, Factors & Metrics
104
 
105
  #### Testing Data
 
 
 
106
 
107
  squad = load_dataset("squad")
108
- squad['test'] = squad['validation']
 
109
 
110
- Dataset({
111
- features: ['id', 'title', 'context', 'question', 'answers'],
112
- num_rows: 10570
113
  })
114
 
115
  #### Metrics
116
 
 
 
 
117
  metric_eval = evaluate.load("squad_v2")
 
118
 
119
  ### Results
120
 
121
- {'exact': 66.00660066006601, 'f1': 78.28040573606134, 'total': 909, 'HasAns_exact': 66.00660066006601, 'HasAns_f1': 78.28040573606134, 'HasAns_total': 909, 'best_exact': 66.00660066006601, 'best_exact_thresh': 0.0, 'best_f1': 78.28040573606134, 'best_f1_thresh': 0.0}
 
 
 
 
 
 
 
 
 
 
22
  - **Language(s) (NLP):** English
23
  - **License:** Apache 2.0
24
  - **Finetuned from model:** [roberta-base](https://huggingface.co/roberta-base)
25
+ - **Maximum input tokens:** 512
26
 
27
  ### Model Sources
28
 
29
  - **training code:** [here](https://github.com/sarapiscitelli/nlp-tasks/blob/main/scripts/train/question_answering.py)
30
  - **evaluation code:** [here](https://github.com/sarapiscitelli/nlp-tasks/blob/main/scripts/evaluation/question_answering.py).
31
 
32
+ ## Uses
33
+ The model can be utilized for the extractive question-answering task, where both the context and the question are provide.
34
+
35
+ ### Recommendations
36
+ This is a basic standard model; some results may be inaccurate.
37
+ Refer to the evaluation metrics for a better understanding of its performance.
38
+
39
+ ## How to Get Started with the Model
40
+
41
+ You can use the Huggingface pipeline:
42
+ ```
43
+ # Use a pipeline as a high-level helper
44
+ from transformers import pipeline
45
+
46
+ qa_model = pipeline("question-answering", model="SaraPiscitelli/roberta-base-qa-v1")
47
+
48
+ question = "Which name is also used to describe the Amazon rainforest in English?"
49
+ context = "The Amazon rainforest (Portuguese: Floresta Amaz么nica or Amaz么nia; Spanish: Selva Amaz贸nica, Amazon铆a or usually Amazonia; French: For锚t amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species."
50
+ print(qa_model(question = question, context = context))
51
+ ```
52
+ or load it directly:
53
+ ```
54
+ import torch
55
+
56
+ from typing import List, Optional
57
+ from tqdm import tqdm
58
+ from transformers import AutoModelForQuestionAnswering, AutoTokenizer
59
+
60
+ class InferenceModel:
61
+
62
+ def __init__(self, model_name_or_checkpoin_path: str,
63
+ tokenizer_name: Optional[str] = None,
64
+ device_type: Optional[str] = None) -> List[str]:
65
+ if tokenizer_name is None:
66
+ tokenizer_name = model_name_or_checkpoin_path
67
+ if device_type is None:
68
+ device_type = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
69
+
70
+ self.model = AutoModelForQuestionAnswering.from_pretrained(model_name_or_checkpoin_path, device_map=device_type)
71
+ self.model.eval()
72
+ self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_checkpoin_path)
73
+
74
+ def inference(self, questions: List[str], contexts: List[str]) -> List[str]:
75
+ inputs = self.tokenizer(questions, contexts,
76
+ padding="longest",
77
+ return_tensors="pt").to(self.model.device)
78
+ with torch.no_grad():
79
+ logits = self.model(**inputs)
80
+ # logits.start_logits.shape == (batch_size, input_length) = inputs['input_ids'].shape
81
+ # logits.end_logits.shape == (batch_size, input_length) = inputs['input_ids'].shape
82
+ answer_start_index: List[int] = logits.start_logits.argmax(dim=-1).tolist()
83
+ answer_end_index: List[int] = logits.end_logits.argmax(dim=-1).tolist()
84
+ answer_tokens: List[str] = [self.tokenizer.decode(inputs.input_ids[i, answer_start_index[i] : answer_end_index[i] + 1])
85
+ for i in range(len(questions))]
86
+ return answer_tokens
87
+
88
+
89
+ model = InferenceModel("SaraPiscitelli/roberta-base-qa-v1")
90
+ question = "Which name is also used to describe the Amazon rainforest in English?"
91
+ context = "The Amazon rainforest (Portuguese: Floresta Amaz么nica or Amaz么nia; Spanish: Selva Amaz贸nica, Amazon铆a or usually Amazonia; French: For锚t amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species."
92
+ print(model.inference(questions=[question], contexts=[context])[0])
93
+ ```
94
+ In both cases, the answer will be printed out: "Amazonia or the Amazon Jungle"
95
+
96
  ## Training Details
97
 
98
  ### Training Data
99
+ - [squad dataset](https://huggingface.co/datasets/squad).
100
+ To retrieve the dataset, use the following code:
101
+ ```
102
+ from datasets import load_dataset
103
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  squad = load_dataset("squad")
105
  squad['train'] = squad['train'].select(range(30000))
106
  squad['test'] = squad['validation']
107
  squad['validation'] = squad['validation'].select(range(2000))
108
+ ```
109
 
110
+ The dataset used after preprocessing is listed below:
111
 
112
+ - Train Dataset({
113
+ features: ['id', 'title', 'context', 'question', 'answers'],
114
+ num_rows: 8207
115
+ })
116
+
117
+ - Validation dataset({
118
+ features: ['id', 'title', 'context', 'question', 'answers'],
119
+ num_rows: 637
120
+ })
121
+
122
  #### Preprocessing
123
 
124
+ All samples with **more than 512 tokens have been removed**.
125
+ This was necessary due to the maximum input token limit accepted by the RoBERTa-base model.
126
 
127
  #### Training Hyperparameters
128
 
 
157
  bf16=False
158
  )
159
 
 
 
 
 
 
 
 
 
 
 
 
160
 
161
  ### Testing Data, Factors & Metrics
162
 
163
  #### Testing Data
164
+ To retrieve the dataset, use the following code:
165
+ ```
166
+ from datasets import load_dataset
167
 
168
  squad = load_dataset("squad")
169
+ squad['test'] = squad['validation']
170
+ ```
171
 
172
+ Test Dataset({
173
+ features: ['id', 'title', 'context', 'question', 'answers'],
174
+ num_rows: 10570
175
  })
176
 
177
  #### Metrics
178
 
179
+ To evaluate model has been used the standard metric for squad:
180
+ ```
181
+ import evaluate
182
  metric_eval = evaluate.load("squad_v2")
183
+ ```
184
 
185
  ### Results
186
 
187
+ {'exact-match': 66.00660066006601,
188
+ 'f1': 78.28040573606134,
189
+ 'total': 909,
190
+ 'HasAns_exact': 66.00660066006601,
191
+ 'HasAns_f1': 78.28040573606134,
192
+ 'HasAns_total': 909,
193
+ 'best_exact': 66.00660066006601,
194
+ 'best_exact_thresh': 0.0,
195
+ 'best_f1': 78.28040573606134,
196
+ 'best_f1_thresh': 0.0}