ml6team
/

keyphrase-extraction-kbir-inspec

Token Classification

keyphrase-extraction

Inference Endpoints

Model card Files Files and versions Community

DeDeckerThomas commited on Apr 29, 2022

Commit

e2808d7

·

1 Parent(s): 23eccc5

Update README.md

Files changed (1) hide show

README.md +39 -1

README.md CHANGED Viewed

@@ -155,7 +155,11 @@ You can find more information here: https://huggingface.co/datasets/midas/inspec
 ## 👷‍♂️ Training procedure
 For more in detail information, you can take a look at the training notebook (link incoming).
-### Preprocessing
 The documents in the dataset are already preprocessed into list of words with the corresponding labels. The only thing that must be done is tokenization and the realignment of the labels so that they correspond with the right subword tokens.
 ```python
 def preprocess_fuction(all_samples_per_split):
@@ -192,6 +196,40 @@ def preprocess_fuction(all_samples_per_split):
     tokenized_samples["labels"] = total_adjusted_labels
     return tokenized_samples
 ```
 ## 📝Evaluation results
 One of the traditional evaluation methods is the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.

 ## 👷‍♂️ Training procedure
 For more in detail information, you can take a look at the training notebook (link incoming).
+### Training parameters
+| Parameter | Value                           |
+| --------- | ------------------------------- |
 The documents in the dataset are already preprocessed into list of words with the corresponding labels. The only thing that must be done is tokenization and the realignment of the labels so that they correspond with the right subword tokens.
 ```python
 def preprocess_fuction(all_samples_per_split):
     tokenized_samples["labels"] = total_adjusted_labels
     return tokenized_samples
 ```
+### Postprocessing
+For the post-processing, you will need to filter out the B and I labeled tokens and concat the consecutive B and Is. As last you strip the keyphrase to ensure all spaces are removed.
+```python
+# Define post_process functions
+def concat_tokens_by_tag(keyphrases):
+    keyphrase_tokens = []
+    for id, label in keyphrases:
+        if label == "B":
+            keyphrase_tokens.append([id])
+        elif label == "I":
+            if len(keyphrase_tokens) > 0:
+                keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
+    return keyphrase_tokens
+def extract_keyphrases(example, predictions, tokenizer, index=0):
+    keyphrases_list = [
+        (id, idx2label[label])
+        for id, label in zip(
+            np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
+        )
+        if idx2label[label] in ["B", "I"]
+    ]
+    processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
+    extracted_kps = tokenizer.batch_decode(
+        processed_keyphrases,
+        skip_special_tokens=True,
+        clean_up_tokenization_spaces=True,
+    )
+    return np.unique([kp.strip() for kp in extracted_kps])
+```
 ## 📝Evaluation results
 One of the traditional evaluation methods is the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.