Update README.md

3824639 over 1 year ago

4.96 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	datasets:
	- google/trueteacher
	- anli
	- cnn_dailymail
	tags:
	- natural-language-inference
	- news-articles-summarization
	---

	# TrueTeacher

	This is a Factual Consistency Evaluation model, introduced in the [TrueTeacher paper (Gekhman et al, 2023)](https://arxiv.org/pdf/2305.11171.pdf).

	## Model Details

	The model is optimized for evaluating factual consistency in summarization.

	It is the main model from the paper (see "T5-11B w. ANLI + TrueTeacher full" in Table 1) which is based on a T5-11B [(Raffel
	et al., 2020)](https://jmlr.org/papers/volume21/20-074/20-074.pdf) fine-tuned with a mixture of the following datasets:
	- [TrueTeacher](https://huggingface.co/datasets/google/trueteacher) ([Gekhman et al., 2023](https://arxiv.org/pdf/2305.11171.pdf))
	- [ANLI](https://huggingface.co/datasets/anli) ([Nie et al., 2020](https://aclanthology.org/2020.acl-main.441.pdf))

	The TrueTeacher dataset contains model-generated summaries of articles from the train split of the CNN/DailyMail dataset [(Hermann et al., 2015)](https://proceedings.neurips.cc/paper_files/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf)
	which are annotated for factual consistency using FLAN-PaLM 540B [(Chung et al.,2022)](https://arxiv.org/pdf/2210.11416.pdf).
	Summaries were generated using summarization models which were trained on the XSum dataset [(Narayan et al., 2018)](https://aclanthology.org/D18-1206.pdf).

	The input format for the model is: "premise: GROUNDING_DOCUMENT hypothesis: HYPOTHESIS_SUMMARY".
	To accomodate the input length of common summarization datasets we recommend setting max_length to 2048.

	The model predicts a binary label ('1' - Factualy Consistent, '0' - Factualy Inconsistent).

	## Evaluation results

	This model achieves the following ROC AUC results on the summarization subset of the [TRUE benchmark (Honovich et al, 2022)](https://arxiv.org/pdf/2204.04991.pdf):

	\| MNBM \| QAGS-X \| FRANK \| SummEval \| QAGS-C \| Average \|
	\|----------\|-----------\|-----------\|--------------\|-----------\|-------------\|
	\| 78.1 \| 89.4 \| 93.6 \| 88.5 \| 89.4 \| 87.8 \|


	## Intended Use

	This model is intended for a research use (non-commercial) in English.

	The recommended use case is evaluating factual consistency in summarization.

	## Out-of-scope use
	Any use cases which violate the cc-by-nc-4.0 license.

	Usage in languages other than English.

	## Usage examples

	#### classification
	```python
	from transformers import T5ForConditionalGeneration
	from transformers import T5Tokenizer

	model_path = 'google/t5_11b_trueteacher_and_anli'
	tokenizer = T5Tokenizer.from_pretrained(model_path)
	model = T5ForConditionalGeneration.from_pretrained(model_path)

	premise = 'the sun is shining'
	for hypothesis, expected in [('the sun is out in the sky', '1'),
	('the cat is shiny', '0')]:
	input_ids = tokenizer(
	f'premise: {premise} hypothesis: {hypothesis}',
	return_tensors='pt',
	truncation=True,
	max_length=2048).input_ids
	outputs = model.generate(input_ids)
	result = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(f'premise: {premise}')
	print(f'hypothesis: {hypothesis}')
	print(f'result: {result} (expected: {expected})\n')
	```

	#### scoring
	```python
	from transformers import T5ForConditionalGeneration
	from transformers import T5Tokenizer
	import torch

	model_path = 'google/t5_11b_trueteacher_and_anli'
	tokenizer = T5Tokenizer.from_pretrained(model_path)
	model = T5ForConditionalGeneration.from_pretrained(model_path)

	premise = 'the sun is shining'
	for hypothesis, expected in [('the sun is out in the sky', '>> 0.5'),
	('the cat is shiny', '<< 0.5')]:
	input_ids = tokenizer(
	f'premise: {premise} hypothesis: {hypothesis}',
	return_tensors='pt',
	truncation=True,
	max_length=2048).input_ids
	decoder_input_ids = torch.tensor([[tokenizer.pad_token_id]])
	outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
	logits = outputs.logits
	probs = torch.softmax(logits[0], dim=-1)
	one_token_id = tokenizer('1').input_ids[0]
	entailment_prob = probs[0, one_token_id].item()
	print(f'premise: {premise}')
	print(f'hypothesis: {hypothesis}')
	print(f'score: {entailment_prob:.3f} (expected: {expected})\n')
	```

	## Citation

	If you use this model for a research publication, please cite the TrueTeacher paper (using the bibtex entry below), as well as the ANLI, CNN/DailyMail, XSum, T5 and FLAN papers mentioned above.

	```
	@misc{gekhman2023trueteacher,
	title={TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models},
	author={Zorik Gekhman and Jonathan Herzig and Roee Aharoni and Chen Elkind and Idan Szpektor},
	year={2023},
	eprint={2305.11171},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```