Dongfu Jiang
commited on
Commit
·
1ad301b
1
Parent(s):
be15112
Update README.md
Browse files
README.md
CHANGED
@@ -10,12 +10,14 @@ license: mit
|
|
10 |
|
11 |
## Introduction
|
12 |
|
13 |
-
We present TIGERScore, a **T**rained metric that follows **I**nstruction **G**uidance to perform **E**xplainable, and **R**eference-free evaluation over a wide spectrum of text generation tasks.
|
14 |
|
15 |
-
Existing automatic metrics are lagging and suffer from issues like 1) Dependency on references
|
16 |
|
17 |
Specifically, TIGERScore takes an instruction, an associated input context along with a hypothesis output that might contain errors. Then, TIGERScore will evaluate this hypothesis output and list several errors, each consisting of the error location, aspect, explanation and penalty scores (score reduced, starting from 0). The sum of the reduced scores is taken as the overall rating of this output.
|
18 |
|
|
|
|
|
19 |
## Training Data
|
20 |
|
21 |
The models are trained on the 🤗 [MetricInstruct Dataset](https://huggingface.co/datasets/TIGER-Lab/MetricInstruct), which covers 6 text generation tasks and 22 text generation datasets. Check out the dataset card for more details.
|
|
|
10 |
|
11 |
## Introduction
|
12 |
|
13 |
+
We present TIGERScore, a **T**rained metric that follows **I**nstruction **G**uidance to perform **E**xplainable, and **R**eference-free evaluation over a wide spectrum of text generation tasks. Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset [MetricInstruct](https://huggingface.co/datasets/TIGER-Lab/MetricInstruct) which covers 6 text generation tasks and 23 text generation datasets.
|
14 |
|
15 |
+
Existing automatic metrics are lagging and suffer from issues like 1) **Dependency on references**, 2) **Limited to specific domains**, 3) **Lack of attribution**. Contrary to them, TIGERScore is designed to be driven by natural language instruction and provide detailed error analysis to pinpoint the mistakes in the generated text.
|
16 |
|
17 |
Specifically, TIGERScore takes an instruction, an associated input context along with a hypothesis output that might contain errors. Then, TIGERScore will evaluate this hypothesis output and list several errors, each consisting of the error location, aspect, explanation and penalty scores (score reduced, starting from 0). The sum of the reduced scores is taken as the overall rating of this output.
|
18 |
|
19 |
+
As a reference-free metric, its correlation can even surpass the best existing reference-based metrics. We believe TIGERScore demonstrates the possibility of building universal explainable metrics to evaluate any text generation task.
|
20 |
+
|
21 |
## Training Data
|
22 |
|
23 |
The models are trained on the 🤗 [MetricInstruct Dataset](https://huggingface.co/datasets/TIGER-Lab/MetricInstruct), which covers 6 text generation tasks and 22 text generation datasets. Check out the dataset card for more details.
|