Nikita Pavlichenko commited on
Commit
167a3c7
·
1 Parent(s): dd150ca

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -0
README.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - text aggregation
6
+ - summarization
7
+ license: Apache 2.0
8
+ datasets:
9
+ - toloka/CrowdSpeech
10
+ metrics:
11
+ - wer
12
+ ---
13
+
14
+ # T5 Large for Text Aggregation
15
+
16
+ ## Model description
17
+
18
+ This is a T5 Large fine-tuned for crowdsourced text aggregation tasks. The model takes multiple performers' responses and yields a single aggregated response. This approach was introduced for the first time during [VLDB'21 Crowd Science Challenge](https://crowdscience.ai/challenges/vldb21) and originally implemented at the second-place competitor's [GitHub](https://github.com/A1exRey/VLDB2021_workshop_t5).
19
+
20
+ ## How to use
21
+
22
+ ```python
23
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
24
+ mname = "toloka/t5-large-for-text-aggregation"
25
+ tokenizer = AutoTokenizer.from_pretrained(mname)
26
+ model = AutoModelForSeq2SeqLM.from_pretrained(mname)
27
+
28
+ input = "samplee text | sampl text | sample textt"
29
+ input_ids = tokenizer.encode(input, return_tensors="pt")
30
+ outputs = model.generate(input_ids)
31
+ decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
32
+ print(decoded) # sample text
33
+ ```
34
+
35
+
36
+ ## Training data
37
+
38
+ Pretrained weights were taken from the [original](https://huggingface.co/t5-large) T5 Large model by Google. For more details on the T5 architecture and training procedure see https://arxiv.org/abs/1910.10683
39
+
40
+ Model was fine-tuned on `train-clean`, `dev-clean` and `dev-other` parts of the [CrowdSpeech](https://huggingface.co/datasets/toloka/CrowdSpeech) dataset that was introduced in [our paper](https://openreview.net/forum?id=3_hgF1NAXU7&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DNeurIPS.cc%2F2021%2FTrack%2FDatasets_and_Benchmarks%2FRound1%2FAuthors%23your-submissions).
41
+
42
+
43
+ ## Training procedure
44
+
45
+ The model was fine-tuned for eight epochs directly following the HuggingFace summarization training [example](https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization).
46
+
47
+ ## Eval results
48
+
49
+ Dataset | Split | WER
50
+ -----------|------------|----------
51
+ CrowdSpeech| test-clean | 4.99
52
+ CrowdSpeech| test-other | 10.61
53
+
54
+
55
+ ### BibTeX entry and citation info
56
+
57
+ ```bibtex
58
+ @misc{pavlichenko2021vox,
59
+ title={Vox Populi, Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription},
60
+ author={Nikita Pavlichenko and Ivan Stelmakh and Dmitry Ustalov},
61
+ year={2021},
62
+ eprint={2107.01091},
63
+ archivePrefix={arXiv},
64
+ primaryClass={cs.SD}
65
+ }
66
+ ```