DeDeckerThomas
commited on
Commit
•
e2808d7
1
Parent(s):
23eccc5
Update README.md
Browse files
README.md
CHANGED
@@ -155,7 +155,11 @@ You can find more information here: https://huggingface.co/datasets/midas/inspec
|
|
155 |
## 👷♂️ Training procedure
|
156 |
For more in detail information, you can take a look at the training notebook (link incoming).
|
157 |
|
158 |
-
###
|
|
|
|
|
|
|
|
|
159 |
The documents in the dataset are already preprocessed into list of words with the corresponding labels. The only thing that must be done is tokenization and the realignment of the labels so that they correspond with the right subword tokens.
|
160 |
```python
|
161 |
def preprocess_fuction(all_samples_per_split):
|
@@ -192,6 +196,40 @@ def preprocess_fuction(all_samples_per_split):
|
|
192 |
tokenized_samples["labels"] = total_adjusted_labels
|
193 |
return tokenized_samples
|
194 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
195 |
## 📝Evaluation results
|
196 |
|
197 |
One of the traditional evaluation methods is the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.
|
|
|
155 |
## 👷♂️ Training procedure
|
156 |
For more in detail information, you can take a look at the training notebook (link incoming).
|
157 |
|
158 |
+
### Training parameters
|
159 |
+
|
160 |
+
| Parameter | Value |
|
161 |
+
| --------- | ------------------------------- |
|
162 |
+
|
163 |
The documents in the dataset are already preprocessed into list of words with the corresponding labels. The only thing that must be done is tokenization and the realignment of the labels so that they correspond with the right subword tokens.
|
164 |
```python
|
165 |
def preprocess_fuction(all_samples_per_split):
|
|
|
196 |
tokenized_samples["labels"] = total_adjusted_labels
|
197 |
return tokenized_samples
|
198 |
```
|
199 |
+
|
200 |
+
### Postprocessing
|
201 |
+
For the post-processing, you will need to filter out the B and I labeled tokens and concat the consecutive B and Is. As last you strip the keyphrase to ensure all spaces are removed.
|
202 |
+
```python
|
203 |
+
# Define post_process functions
|
204 |
+
def concat_tokens_by_tag(keyphrases):
|
205 |
+
keyphrase_tokens = []
|
206 |
+
for id, label in keyphrases:
|
207 |
+
if label == "B":
|
208 |
+
keyphrase_tokens.append([id])
|
209 |
+
elif label == "I":
|
210 |
+
if len(keyphrase_tokens) > 0:
|
211 |
+
keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
|
212 |
+
return keyphrase_tokens
|
213 |
+
|
214 |
+
|
215 |
+
def extract_keyphrases(example, predictions, tokenizer, index=0):
|
216 |
+
keyphrases_list = [
|
217 |
+
(id, idx2label[label])
|
218 |
+
for id, label in zip(
|
219 |
+
np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
|
220 |
+
)
|
221 |
+
if idx2label[label] in ["B", "I"]
|
222 |
+
]
|
223 |
+
|
224 |
+
processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
|
225 |
+
extracted_kps = tokenizer.batch_decode(
|
226 |
+
processed_keyphrases,
|
227 |
+
skip_special_tokens=True,
|
228 |
+
clean_up_tokenization_spaces=True,
|
229 |
+
)
|
230 |
+
return np.unique([kp.strip() for kp in extracted_kps])
|
231 |
+
|
232 |
+
```
|
233 |
## 📝Evaluation results
|
234 |
|
235 |
One of the traditional evaluation methods is the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.
|