|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- en |
|
datasets: |
|
- google/trueteacher |
|
- anli |
|
- cnn_dailymail |
|
tags: |
|
- natural-language-inference |
|
- news-articles-summarization |
|
--- |
|
|
|
# **TrueTeacher** |
|
|
|
This is a **Factual Consistency Evaluation** model, introduced in the [TrueTeacher paper (Gekhman et al, 2023)](https://arxiv.org/pdf/2305.11171.pdf). |
|
|
|
## Model Details |
|
|
|
The model is optimized for evaluating factual consistency in **summarization**. |
|
|
|
It is the main model from the paper (see "T5-11B w. ANLI + TrueTeacher full" in Table 1) which is based on a **T5-11B** [(Raffel |
|
et al., 2020)](https://jmlr.org/papers/volume21/20-074/20-074.pdf) fine-tuned with a mixture of the following datasets: |
|
- [TrueTeacher](https://huggingface.co/datasets/google/trueteacher) ([Gekhman et al., 2023](https://arxiv.org/pdf/2305.11171.pdf)) |
|
- [ANLI](https://huggingface.co/datasets/anli) ([Nie et al., 2020](https://aclanthology.org/2020.acl-main.441.pdf)) |
|
|
|
The TrueTeacher dataset contains model-generated summaries of articles from the train split of the **CNN/DailyMail** dataset [(Hermann et al., 2015)](https://proceedings.neurips.cc/paper_files/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf) |
|
which are annotated for factual consistency using **FLAN-PaLM 540B** [(Chung et al.,2022)](https://arxiv.org/pdf/2210.11416.pdf). |
|
Summaries were generated using summarization models which were trained on the **XSum** dataset [(Narayan et al., 2018)](https://aclanthology.org/D18-1206.pdf). |
|
|
|
The input format for the model is: "premise: GROUNDING_DOCUMENT hypothesis: HYPOTHESIS_SUMMARY". |
|
To accomodate the input length of common summarization datasets we recommend setting **max_length** to **2048**. |
|
|
|
The model predicts a binary label ('1' - Factualy Consistent, '0' - Factualy Inconsistent). |
|
|
|
## Evaluation results |
|
|
|
This model achieves the following ROC AUC results on the summarization subset of the [TRUE benchmark (Honovich et al, 2022)](https://arxiv.org/pdf/2204.04991.pdf): |
|
|
|
| **MNBM** | **QAGS-X** | **FRANK** | **SummEval** | **QAGS-C** | **Average** | |
|
|----------|-----------|-----------|--------------|-----------|-------------| |
|
| 78.1 | 89.4 | 93.6 | 88.5 | 89.4 | 87.8 | |
|
|
|
|
|
## Intended Use |
|
|
|
This model is intended for a research use (**non-commercial**) in English. |
|
|
|
The recommended use case is evaluating factual consistency in summarization. |
|
|
|
## Out-of-scope use |
|
Any use cases which violate the **cc-by-nc-4.0** license. |
|
|
|
Usage in languages other than English. |
|
|
|
## Usage examples |
|
|
|
#### classification |
|
```python |
|
from transformers import T5ForConditionalGeneration |
|
from transformers import T5Tokenizer |
|
|
|
model_path = 'google/t5_11b_trueteacher_and_anli' |
|
tokenizer = T5Tokenizer.from_pretrained(model_path) |
|
model = T5ForConditionalGeneration.from_pretrained(model_path) |
|
|
|
premise = 'the sun is shining' |
|
for hypothesis, expected in [('the sun is out in the sky', '1'), |
|
('the cat is shiny', '0')]: |
|
input_ids = tokenizer( |
|
f'premise: {premise} hypothesis: {hypothesis}', |
|
return_tensors='pt', |
|
truncation=True, |
|
max_length=2048).input_ids |
|
outputs = model.generate(input_ids) |
|
result = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(f'premise: {premise}') |
|
print(f'hypothesis: {hypothesis}') |
|
print(f'result: {result} (expected: {expected})\n') |
|
``` |
|
|
|
#### scoring |
|
```python |
|
from transformers import T5ForConditionalGeneration |
|
from transformers import T5Tokenizer |
|
import torch |
|
|
|
model_path = 'google/t5_11b_trueteacher_and_anli' |
|
tokenizer = T5Tokenizer.from_pretrained(model_path) |
|
model = T5ForConditionalGeneration.from_pretrained(model_path) |
|
|
|
premise = 'the sun is shining' |
|
for hypothesis, expected in [('the sun is out in the sky', '>> 0.5'), |
|
('the cat is shiny', '<< 0.5')]: |
|
input_ids = tokenizer( |
|
f'premise: {premise} hypothesis: {hypothesis}', |
|
return_tensors='pt', |
|
truncation=True, |
|
max_length=2048).input_ids |
|
decoder_input_ids = torch.tensor([[tokenizer.pad_token_id]]) |
|
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids) |
|
logits = outputs.logits |
|
probs = torch.softmax(logits[0], dim=-1) |
|
one_token_id = tokenizer('1').input_ids[0] |
|
entailment_prob = probs[0, one_token_id].item() |
|
print(f'premise: {premise}') |
|
print(f'hypothesis: {hypothesis}') |
|
print(f'score: {entailment_prob:.3f} (expected: {expected})\n') |
|
``` |
|
|
|
## Citation |
|
|
|
If you use this model for a research publication, please cite the TrueTeacher paper (using the bibtex entry below), as well as the ANLI, CNN/DailyMail, XSum, T5 and FLAN papers mentioned above. |
|
|
|
``` |
|
@misc{gekhman2023trueteacher, |
|
title={TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models}, |
|
author={Zorik Gekhman and Jonathan Herzig and Roee Aharoni and Chen Elkind and Idan Szpektor}, |
|
year={2023}, |
|
eprint={2305.11171}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |