lovodkin93's picture
Update README.md
483506d
|
raw
history blame
6.22 kB
metadata
license: apache-2.0
language:
  - en
tags:
  - controlled-text-reduction
  - news-articles-summarization

Controlled Text Reduction model

This is a Controlled Text Reduction model, introduced in the Don't Add Don't Miss paper (Slobodkin et al, 2023).

Model Details

The model is optimized for performing controlled text reduction in summarization.

It is the best-performing model from the paper (see "Distilled Flan-T5 + RL" in Table 1) which is based on a Flan-T5-large (Chung et al., 2022) fine-tuned on the distilled Controlled Text Reduction dataset.

The input format for the model is: '_TREE_TOKEN_00000 Instruction: In this task, you are presented with a passage, where some parts are "highlighted" (namely, there are and tokens before and after each such span). Your job is to generate a summary that covers all and only the "highlighted" spans. Passage: PASSAGE_WITH_HIGHLIGHTS"

where PASSAGE_WITH_HIGHLIGHTS is the highlighted text, with highlights being marked by <extra_id_1> and <extra_id_2> before and after each consecutive highlight, respectively. To accomodate the input length of common summarization datasets we recommend setting max_length to 2048.

The model generates the highlight-oriented reduction of the text.

Evaluation results

This model achieves the following scores (compared to the concatenated highlights):

R-1 R-2 R-L Meteor Bertscore
85.0 74.3 82.1 84.3 81.9

Intended Use

This model is intended for a research use (non-commercial) in English.

The recommended use case is performing controlled text reduction on news-related texts.

Out-of-scope use

Any use cases which violate the model's license.

Usage in languages other than English.

Usage examples

classification

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_path = 'biu-nlp/distil-Flan-T5-large-controlled-text-reduction-RL'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

highlighted_text = """<extra_id_1> Debi Thomas\' dream of Olympic gold turned into disappointment Saturday as East Germany\'s Katarina Witt won her second straight Olympic championship and Canadian Elizabeth Manley took home the silver<extra_id_2> before a crowd of cheering countrymen. "It\'s over. Back to school," said <extra_id_1> Thomas<extra_id_2>, who <extra_id_1> won the bronze<extra_id_2> medal <extra_id_1> despite three faulty landings<extra_id_2>. "I\'m not going to make any excuses. I was really skating well this week. It wasn\'t supposed to happen, I guess. But I tried." While the top two skaters in the world staged a shootout to music from Bizet\'s "Carmen," Manley was so sensational in the freestyle that she finished first with seven judges. Combined with a fourth in the compulsory figures and a third-place finish in the short program earlier in the week, the performance put Manley in second place. Witt, a three-time world champion from East Germany, became the first repeat singles champion since Dick Button took Olympic gold in 1948 and \'52. Sonja Henie of Norway was the only woman to do it before Witt, winning in 1928, 1932 and 1936. <extra_id_1> Thomas, of San Jose, Calif., the first black to win a U.S. figure skating crown and the 1986 world champion, skated poorly Saturday after doing well earlier in the Games<extra_id_2>. By contrast, Manley had the sellout crowd at the Olympic Saddledome enraptured. They cheered, hooted and stamped their feet when she finished hitting every element of her program. Jill Trenary of Minnetonka, Minn., finished fourth. She was fifth heading into the long program, worth 50 percent of the overall score. <extra_id_1> Thomas<extra_id_2>\'bronze <extra_id_1> was<extra_id_2> the third figure skating medal here <extra_id_1> for the United States<extra_id_2>. <extra_id_1> Brian Boitano won the men\'s crown<extra_id_2>, and <extra_id_1> a bronze in pairs went to<extra_id_2> Jill Watson and Peter Oppegard. <extra_id_1> In addition<extra_id_2> to the three <extra_id_1> figure skating<extra_id_2> medals, <extra_id_1> the U.S. team had three speed-skating medals: one each gold, silver and bronze<extra_id_2>. Speed skater Bonnie Blair, America\'s only double medalist, tried again Saturday in the 1,500 meters but finished fourth, well off the pace. She won the gold in the 500 and the bronze in the 1,000 meters. As the Olympics winded up its next-to-last day, the Soviet Union had 27 medals, including 11 golds, while East Germany in second place had 22, including nine golds."""

input_text = f"""_TREE_TOKEN_00000 Instruction: In this task, you are presented with a passage, where some parts are "highlighted" (namely, there are <extra_id_1> and <extra_id_2> tokens before and after each such span). Your job is to generate a summary that covers all and only the "highlighted" spans. Passage: {highlighted_text}"""

input_ids = tokenizer(input_text, max_length=2048, return_tensors="pt") 

model_kwargs = {
                "max_length":512,
                "num_beams":2,
                "no_repeat_ngram_size":3,
                "length_penalty":2.0,
                "top_p":0.9,
                "do_sample":True
                }


model.eval()
outputs = model.generate(**input_ids, **model_kwargs).tolist()[0]
result = tokenizer.decode(outputs, skip_special_tokens=True)
print(f"The reduction is:\n {result}")

Citation

If you use this model for a research publication, please cite the Don't Add Don't Miss paper (using the bibtex entry below), as well as the original Controlled Text Reduction paper (Slobodkin et al, 2022).

@misc{slobodkin2023dont,
      title={Dont Add, dont Miss: Effective Content Preserving Generation from Pre-Selected Text Spans}, 
      author={Aviv Slobodkin and Avi Caciularu and Eran Hirsch and Ido Dagan},
      year={2023},
      eprint={2310.09017},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}