TIGER-Lab
/

TIGERScore-7B

@@ -10,12 +10,14 @@ license: mit
 ## Introduction
-We present TIGERScore, a **T**rained metric that follows **I**nstruction **G**uidance to perform **E**xplainable, and **R**eference-free evaluation over a wide spectrum of text generation tasks. TIGERScore is guided by natural language instruction to provide error analysis to pinpoint the mistakes in the generated text. Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct which covers 6 text generation tasks and 23 text generation datasets. As a reference-free metric, its correlation can even surpass the best existing reference-based metrics. To further qualitatively assess the rationale generated by our metric, we conduct human evaluation on the generated explanations and found that the explanations are 70.8% accurate. Through these experimental results, we believe TIGERScore demonstrates the possibility of building universal explainable metrics to evaluate any text generation task.
-Existing automatic metrics are lagging and suffer from issues like 1) Dependency on references, 2) Limited to specific domains, 3) Lack of attribution. Contrary to them, TIGERScore is designed to be driven by natural language instruction and provide detailed error analysis to pinpoint the mistakes in the generated text.
 Specifically, TIGERScore takes an instruction, an associated input context along with a hypothesis output that might contain errors. Then, TIGERScore will evaluate this hypothesis output and list several errors, each consisting of the error location, aspect, explanation and penalty scores (score reduced, starting from 0). The sum of the reduced scores is taken as the overall rating of this output.
 ## Training Data
 The models are trained on the 🤗 [MetricInstruct Dataset](https://huggingface.co/datasets/TIGER-Lab/MetricInstruct), which covers 6 text generation tasks and 22 text generation datasets. Check out the dataset card for more details.

 ## Introduction
+We present TIGERScore, a **T**rained metric that follows **I**nstruction **G**uidance to perform **E**xplainable, and **R**eference-free evaluation over a wide spectrum of text generation tasks. Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset [MetricInstruct](https://huggingface.co/datasets/TIGER-Lab/MetricInstruct) which covers 6 text generation tasks and 23 text generation datasets.
+Existing automatic metrics are lagging and suffer from issues like 1) **Dependency on references**, 2) **Limited to specific domains**, 3) **Lack of attribution**. Contrary to them, TIGERScore is designed to be driven by natural language instruction and provide detailed error analysis to pinpoint the mistakes in the generated text.
 Specifically, TIGERScore takes an instruction, an associated input context along with a hypothesis output that might contain errors. Then, TIGERScore will evaluate this hypothesis output and list several errors, each consisting of the error location, aspect, explanation and penalty scores (score reduced, starting from 0). The sum of the reduced scores is taken as the overall rating of this output.
+As a reference-free metric, its correlation can even surpass the best existing reference-based metrics. We believe TIGERScore demonstrates the possibility of building universal explainable metrics to evaluate any text generation task.
 ## Training Data
 The models are trained on the 🤗 [MetricInstruct Dataset](https://huggingface.co/datasets/TIGER-Lab/MetricInstruct), which covers 6 text generation tasks and 22 text generation datasets. Check out the dataset card for more details.