elishowk's picture
Automatic correction of README.md metadata. Contact [email protected] for any question
ec2d49a
|
raw
history blame
3.07 kB
metadata
language:
  - en
tags:
  - text aggregation
  - summarization
license: apache-2.0
datasets:
  - toloka/CrowdSpeech
metrics:
  - wer

T5 Large for Text Aggregation

Model description

This is a T5 Large fine-tuned for crowdsourced text aggregation tasks. The model takes multiple performers' responses and yields a single aggregated response. This approach was introduced for the first time during VLDB 2021 Crowd Science Challenge and originally implemented at the second-place competitor's GitHub. The paper describing this model was presented at the 2nd Crowd Science Workshop.

How to use

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
mname = "toloka/t5-large-for-text-aggregation"
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)

input = "samplee text | sampl text | sample textt"
input_ids = tokenizer.encode(input, return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)  # sample text

Training data

Pretrained weights were taken from the original T5 Large model by Google. For more details on the T5 architecture and training procedure see https://arxiv.org/abs/1910.10683

Model was fine-tuned on train-clean, dev-clean and dev-other parts of the CrowdSpeech dataset that was introduced in our paper.

Training procedure

The model was fine-tuned for eight epochs directly following the HuggingFace summarization training example.

Eval results

Dataset Split WER
CrowdSpeech test-clean 4.99
CrowdSpeech test-other 10.61

BibTeX entry and citation info

@inproceedings{Pletenev:21,
  author    = {Pletenev, Sergey},
  title     = {{Noisy Text Sequences Aggregation as a Summarization Subtask}},
  year      = {2021},
  booktitle = {Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale},
  pages     = {15--20},
  address   = {Copenhagen, Denmark},
  issn      = {1613-0073},
  url       = {http://ceur-ws.org/Vol-2932/short2.pdf},
  language  = {english},
}
@misc{pavlichenko2021vox,
      title={Vox Populi, Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription}, 
      author={Nikita Pavlichenko and Ivan Stelmakh and Dmitry Ustalov},
      year={2021},
      eprint={2107.01091},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}