Spaces:
Sleeping
Sleeping
File size: 10,985 Bytes
4d27d08 c1d20c8 4d27d08 cf8a80d 4d27d08 c1d20c8 4d27d08 c1d20c8 4d27d08 c1d20c8 4d27d08 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 |
import datasets
import evaluate
from blanc import BlancHelp, BlancTune
_DESCRIPTION = """
BLANC is an automatic method for estimating the quality of document summaries without the need for human-written reference summaries. It works by measuring how much a summary helps a pre-trained language model (like BERT) perform a language understanding task, such as filling in masked words in a document. The two main variations are:
1. BLANC-help: The summary is concatenated with document sentences during the inference task. The BLANC-help score is defined as the difference in accuracy between unmasking tokens with the summary and with filler text (same length as the summary but consisting of period symbols). It measures how much the summary boosts the model's performance on masked tokens.
2. BLANC-tune: The model is fine-tuned on the summary text before it processes the entire document. The BLANC-tune score is calculated by comparing the performance of the fine-tuned model with that of the original model, both tasked with unmasking tokens in the document text. This method reflects how much the model's ability to understand the document improves after learning from the summary.
These BLANC measures show good correlation with human evaluations, similar to ROUGE scores, but do not require reference summaries.
See the BLANC paper for more details: https://aclanthology.org/2020.eval4nlp-1.2/
"""
_KWARGS_DESCRIPTION = """
Args:
documents (list of str): Documents.
summaries (list of str): Predicted summaries.
model_name (str, optional): BERT model type to use for evaluation. Default is "bert-base-uncased".
measure (str, optional): Measure type, either "improve" or "relative", as defined in the BLANC paper. Default is "relative".
blanc_score (str, optional): BLANC score type, either "help" or "tune". Default is "help".
gap (int, optional): Distance between words to mask during inference. Default is 2.
gap_mask (int, optional): Number of tokens to mask at each designated position during inference. Default is 1.
gap_tune (int, optional): Distance between words to mask during fine-tuning. Default is 2.
gap_mask_tune (int, optional): Number of tokens to mask at each designated position during fine-tuning. Default is 1.
min_token_length_normal (int, optional): Minimum number of characters in normal tokens to mask (whole words) during inference. Default is 4.
min_token_length_lead (int, optional): Minimum number of characters in lead tokens (first part of words) to mask during inference. Default is 2.
min_token_length_followup (int, optional): Minimum number of characters in follow-up tokens (continuations of words) to mask during inference. Default is 100.
min_token_length_normal_tune (int, optional): Minimum number of characters in normal tokens to mask during fine-tuning. Default is -1.
min_token_length_lead_tune (int, optional): Minimum number of characters in lead tokens to mask during fine-tuning. Default is -1.
min_token_length_followup_tune (int, optional): Minimum number of characters in follow-up tokens to mask during fine-tuning. Default is -1.
device (str, optional): Device to run the model on, either "cpu" or "cuda". Defaults to "cpu".
random_seed (int, optional): Random seed for Python and PyTorch. Default is 0.
inference_batch_size (int, optional): Batch size for inference. Default is 1.
inference_mask_evenly (bool, optional): Whether to mask every `gap` tokens during inference (True) or mask randomly with a probability of 0.15 (False). Default is True.
show_progress_bar (bool, optional): Whether to show progress bars during evaluation. Default is True.
BLANC-help specific arguments:
filler_token (str, optional): Token to use as filler in lieu of the summary. Default is ".".
help_sep (str, optional): Token used to separate the summary (or filler) from the sentence, or '' for no separator. Default is "".
BLANC-tune specific arguments:
finetune_batch_size (int, optional): Batch size to use when fine-tuning on the summary. Default is 1.
finetune_epochs (int, optional): Number of epochs for fine-tuning on the summary. Default is 10.
finetune_mask_evenly (bool, optional): Whether to mask every `gap` tokens during fine-tuning (True) or mask randomly with a probability of 0.15 (False). Default is True.
finetune_chunk_size (int, optional): Number of summary tokens to use at a time during fine-tuning. Default is 64.
finetune_chunk_stride (int, optional): Number of tokens between summary chunks for fine-tuning. Default is 32.
learning_rate (float, optional): Learning rate for fine-tuning on the summary. Default is 5e-05.
warmup_steps (int, optional): Number of warmup steps for fine-tuning. Default is 0.
Returns:
score (float): The calculated BLANC score based on the selected method (BLANC-help or BLANC-tune).
"""
_CITATION = """
@inproceedings{vasilyev-etal-2020-fill,
title = "Fill in the {BLANC}: Human-free quality estimation of document summaries",
author = "Vasilyev, Oleg and
Dharnidharka, Vedant and
Bohannon, John",
editor = "Eger, Steffen and
Gao, Yang and
Peyrard, Maxime and
Zhao, Wei and
Hovy, Eduard",
booktitle = "Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.eval4nlp-1.2",
doi = "10.18653/v1/2020.eval4nlp-1.2",
pages = "11--20",
abstract = "We present BLANC, a new approach to the automatic estimation of document summary quality. Our goal is to measure the functional performance of a summary with an objective, reproducible, and fully automated method. Our approach achieves this by measuring the performance boost gained by a pre-trained language model with access to a document summary while carrying out its language understanding task on the document{'}s text. We present evidence that BLANC scores have as good correlation with human evaluations as do the ROUGE family of summary quality measurements. And unlike ROUGE, the BLANC method does not require human-written reference summaries, allowing for fully human-free summary quality estimation.",
}
"""
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class BLANC(evaluate.Metric):
def _info(self):
return evaluate.MetricInfo(
description=_DESCRIPTION,
citation=_CITATION,
homepage="https://github.com/PrimerAI/blanc",
inputs_description=_KWARGS_DESCRIPTION,
features=[
datasets.Features(
{
"documents": datasets.Value("string", id="sequence"),
"summaries": datasets.Value("string", id="sequence"),
}
),
],
codebase_urls=["https://github.com/PrimerAI/blanc"],
reference_urls=[
"https://github.com/PrimerAI/blanc",
"https://aclanthology.org/2020.eval4nlp-1.2/",
],
)
def _download_and_prepare(self, dl_manager):
import nltk
nltk.download("punkt_tab")
def _compute(
self,
documents,
summaries,
model_name="bert-base-uncased",
blanc_score="help",
measure="relative",
gap=2,
gap_mask=1,
gap_tune=2,
gap_mask_tune=1,
min_token_length_normal=4,
min_token_length_lead=2,
min_token_length_followup=100,
min_token_length_normal_tune=-1,
min_token_length_lead_tune=-1,
min_token_length_followup_tune=-1,
device="cpu",
random_seed=0,
inference_batch_size=1,
inference_mask_evenly=True,
filler_token=".",
help_sep="",
finetune_batch_size=1,
finetune_epochs=10,
finetune_mask_evenly=True,
finetune_chunk_size=64,
finetune_chunk_stride=32,
learning_rate=5e-05,
warmup_steps=0,
show_progress_bar=True,
):
# Choose between BLANC-help and BLANC-tune methods based on measure argument.
if blanc_score == "help":
blanc_instance = BlancHelp(
model_name=model_name,
measure=measure,
gap=gap,
gap_mask=gap_mask,
gap_tune=gap_tune,
gap_mask_tune=gap_mask_tune,
min_token_length_normal=min_token_length_normal,
min_token_length_lead=min_token_length_lead,
min_token_length_followup=min_token_length_followup,
min_token_length_normal_tune=min_token_length_normal_tune,
min_token_length_lead_tune=min_token_length_lead_tune,
device=device,
inference_batch_size=inference_batch_size,
inference_mask_evenly=inference_mask_evenly,
filler_token=filler_token,
help_sep=help_sep,
show_progress_bar=show_progress_bar,
)
elif blanc_score == "tune":
blanc_instance = BlancTune(
model_name=model_name,
measure=measure,
gap=gap,
gap_mask=gap_mask,
gap_tune=gap_tune,
gap_mask_tune=gap_mask_tune,
min_token_length_normal=min_token_length_normal,
min_token_length_lead=min_token_length_lead,
min_token_length_followup_tune=min_token_length_followup_tune,
min_token_length_normal_tune=min_token_length_normal_tune,
min_token_length_lead_tune=min_token_length_lead_tune,
device=device,
random_seed=random_seed,
inference_batch_size=inference_batch_size,
inference_mask_evenly=inference_mask_evenly,
finetune_batch_size=finetune_batch_size,
finetune_epochs=finetune_epochs,
finetune_mask_evenly=finetune_mask_evenly,
finetune_chunk_size=finetune_chunk_size,
finetune_chunk_stride=finetune_chunk_stride,
learning_rate=learning_rate,
warmup_steps=warmup_steps,
show_progress_bar=show_progress_bar,
)
else:
raise ValueError(f"Invalid measure argument: {measure}. Choose 'help' or 'tune'.")
score = blanc_instance.eval_pairs(documents, summaries)
output_dict = {
f"blanc_{blanc_score}": score # Replace with actual computed score
}
return output_dict
|