Nikita Pavlichenko
commited on
Commit
·
167a3c7
1
Parent(s):
dd150ca
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
tags:
|
5 |
+
- text aggregation
|
6 |
+
- summarization
|
7 |
+
license: Apache 2.0
|
8 |
+
datasets:
|
9 |
+
- toloka/CrowdSpeech
|
10 |
+
metrics:
|
11 |
+
- wer
|
12 |
+
---
|
13 |
+
|
14 |
+
# T5 Large for Text Aggregation
|
15 |
+
|
16 |
+
## Model description
|
17 |
+
|
18 |
+
This is a T5 Large fine-tuned for crowdsourced text aggregation tasks. The model takes multiple performers' responses and yields a single aggregated response. This approach was introduced for the first time during [VLDB'21 Crowd Science Challenge](https://crowdscience.ai/challenges/vldb21) and originally implemented at the second-place competitor's [GitHub](https://github.com/A1exRey/VLDB2021_workshop_t5).
|
19 |
+
|
20 |
+
## How to use
|
21 |
+
|
22 |
+
```python
|
23 |
+
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
|
24 |
+
mname = "toloka/t5-large-for-text-aggregation"
|
25 |
+
tokenizer = AutoTokenizer.from_pretrained(mname)
|
26 |
+
model = AutoModelForSeq2SeqLM.from_pretrained(mname)
|
27 |
+
|
28 |
+
input = "samplee text | sampl text | sample textt"
|
29 |
+
input_ids = tokenizer.encode(input, return_tensors="pt")
|
30 |
+
outputs = model.generate(input_ids)
|
31 |
+
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
32 |
+
print(decoded) # sample text
|
33 |
+
```
|
34 |
+
|
35 |
+
|
36 |
+
## Training data
|
37 |
+
|
38 |
+
Pretrained weights were taken from the [original](https://huggingface.co/t5-large) T5 Large model by Google. For more details on the T5 architecture and training procedure see https://arxiv.org/abs/1910.10683
|
39 |
+
|
40 |
+
Model was fine-tuned on `train-clean`, `dev-clean` and `dev-other` parts of the [CrowdSpeech](https://huggingface.co/datasets/toloka/CrowdSpeech) dataset that was introduced in [our paper](https://openreview.net/forum?id=3_hgF1NAXU7&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DNeurIPS.cc%2F2021%2FTrack%2FDatasets_and_Benchmarks%2FRound1%2FAuthors%23your-submissions).
|
41 |
+
|
42 |
+
|
43 |
+
## Training procedure
|
44 |
+
|
45 |
+
The model was fine-tuned for eight epochs directly following the HuggingFace summarization training [example](https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization).
|
46 |
+
|
47 |
+
## Eval results
|
48 |
+
|
49 |
+
Dataset | Split | WER
|
50 |
+
-----------|------------|----------
|
51 |
+
CrowdSpeech| test-clean | 4.99
|
52 |
+
CrowdSpeech| test-other | 10.61
|
53 |
+
|
54 |
+
|
55 |
+
### BibTeX entry and citation info
|
56 |
+
|
57 |
+
```bibtex
|
58 |
+
@misc{pavlichenko2021vox,
|
59 |
+
title={Vox Populi, Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription},
|
60 |
+
author={Nikita Pavlichenko and Ivan Stelmakh and Dmitry Ustalov},
|
61 |
+
year={2021},
|
62 |
+
eprint={2107.01091},
|
63 |
+
archivePrefix={arXiv},
|
64 |
+
primaryClass={cs.SD}
|
65 |
+
}
|
66 |
+
```
|