Update README.md
Browse files
README.md
CHANGED
@@ -1,208 +1,3 @@
|
|
1 |
---
|
2 |
license: unknown
|
3 |
-
---
|
4 |
-
|
5 |
-
# Seamless Expressive
|
6 |
-
|
7 |
-
SeamlessExpressive is a speech-to-speech translation model that captures certain underexplored aspects of prosody such as speech rate and pauses, while preserving the style of one's voice and high content translation quality.
|
8 |
-
|
9 |
-
SeamlessExpressive model consists of two main modules:
|
10 |
-
1. Prosody UnitY2, which is a prosody-aware speech-to-unit translation model based on UnitY2 architecture;
|
11 |
-
2. PRETSSEL, which is a unit-to-speech model featuring cross-lingual expressivity preservation.
|
12 |
-
|
13 |
-

|
14 |
-
|
15 |
-
|
16 |
-
## Prosody UnitY2
|
17 |
-
|
18 |
-
Prosody UnitY2 is an expressive speech-to-unit translation model, injecting expressivity embedding from PRETSSEL into the unit generation. It could transfer phrase-level prosody such as speech rate or pauses.
|
19 |
-
|
20 |
-
|
21 |
-
## PRETSSEL
|
22 |
-
|
23 |
-
**P**aralinguistic **RE**presentation-based
|
24 |
-
**T**extle**SS** acoustic mod**EL** (PRETSSEL) is an expressive unit-to-speech generator, and it can efficiently disentangle semantic and expressivity components from speech. It transfers utterance-level expressivity like the style of one's voice.
|
25 |
-
|
26 |
-
# Benchmark Datasets
|
27 |
-
|
28 |
-
## mExpresso (Multilingual Expresso)
|
29 |
-
|
30 |
-
mExpresso is an expressive S2ST dataset that includes seven styles of read speech (i.e., default, happy, sad, confused, enunciated, whisper and laughing) between English and five other languages -- French, German, Italian, Mandarin and Spanish. We create the dataset by expanding a subset of read speech in [Expresso Dataset](https://github.com/facebookresearch/textlesslib/tree/main/examples/expresso/dataset). We first translate the English transcriptions into other languages, including the emphasis markers in the transcription, and then the gender matched bilingual speakers read the translation in the style suggested by the markers.
|
31 |
-
|
32 |
-
We are currently open source the text translation of the other language to enable evaluating English to other directions. We will open source the audio files in the near future.
|
33 |
-
|
34 |
-
Text translation in other languages can be [Downloaded](https://dl.fbaipublicfiles.com/seamless/datasets/mexpresso_text/mexpresso_text.tar).
|
35 |
-
|
36 |
-
### Statistics of mExpresso
|
37 |
-
| language pair | subset | # items | English duration (hr) | # speakers |
|
38 |
-
|---------------|--------|---------|-----------------------|------------|
|
39 |
-
|eng-cmn| dev | 2369 | 2.1 | 1 |
|
40 |
-
| | test | 5003 | 4.8 | 2 |
|
41 |
-
|eng-deu| dev | 4420 | 3.9 | 2 |
|
42 |
-
| | test | 5733 | 5.6 | 2 |
|
43 |
-
|eng-fra| dev | 4770 | 4.2 | 2 |
|
44 |
-
| | test | 5742 | 5.6 | 2 |
|
45 |
-
|eng-ita| dev | 4413 | 3.9 | 2 |
|
46 |
-
| | test | 5756 | 5.7 | 2 |
|
47 |
-
|eng-spa| dev | 4758 | 4.2 | 2 |
|
48 |
-
| | test | 5693 | 5.5 | 2 |
|
49 |
-
|
50 |
-
### Combine with English Expresso to create mExpresso S2T dataset
|
51 |
-
To create English to other langauges speech-to-text dataset, run the following command. It will first download the English Expresso dataset, downsample the audio to 16k Hz, and join with the text translation to form the manifest.
|
52 |
-
|
53 |
-
```python
|
54 |
-
python3 -m seamless_communication.cli.expressivity.data.prepare_mexpresso \
|
55 |
-
<OUTPUT_FOLDER>
|
56 |
-
```
|
57 |
-
|
58 |
-
The output manifest will be located at `<OUTPUT_FOLDER>/{dev,test}_mexpresso_eng_{spa,fra,deu,ita,cmn}.tsv`
|
59 |
-
|
60 |
-
|
61 |
-
## Automatic evaluation
|
62 |
-
|
63 |
-
Python package dependencies (on top of seamless_communication, coming from stopes pipelines):
|
64 |
-
* Unidecode
|
65 |
-
* scipy
|
66 |
-
* phonemizer
|
67 |
-
* s3prl
|
68 |
-
* syllables
|
69 |
-
* ipapy
|
70 |
-
* pkuseg
|
71 |
-
* nltk
|
72 |
-
* fire
|
73 |
-
|
74 |
-
```bash
|
75 |
-
pip install Unidecode scipy phonemizer s3prl syllables ipapy pkuseg nltk fire
|
76 |
-
```
|
77 |
-
|
78 |
-
As described in Section 4.3 we use following automatic metrics:
|
79 |
-
|
80 |
-
1. **ASR-BLEU**: refer to `/src/seamless_communication/cli/eval_utils` to see how the OpenAI whisper ASR model is used to extract transcriptions from generated audios.
|
81 |
-
|
82 |
-
2. **Vocal Style Similarity**: refer to *TBD stopes public link* for implementation details.
|
83 |
-
|
84 |
-
3. **AutoPCP**: refer to *TBD stopes public link* for implementation details.
|
85 |
-
|
86 |
-
4. **Pause and Rate scores**: refer to *TBD stopes public link* for implementation details. Rate score corresponds to the syllable speech rate spearman correlation between source and predicted speech. Pause score corresponds to the weighted mean joint score produced by `stopes/eval/local_prosody/compare_utterances.py` script from stopes repo.
|
87 |
-
|
88 |
-
## Evaluation results: mExpresso
|
89 |
-
|
90 |
-
Please see *TBD public data link* on how to download evaluation data
|
91 |
-
|
92 |
-
*Important Notes*:
|
93 |
-
|
94 |
-
* We used empirically chosen duration factors per each tgt language towards the best perceptual quality: 1.0 (default) for cmn, spa, ita; 1.1 for deu; 1.2 for fra. Same settings were used to report results in the "Seamless: Multilingual Expressive and Streaming Speech Translation" paper.
|
95 |
-
|
96 |
-
* Results here slightly differs from ones shown in the paper due to several descrepancies in the pipeline: results reported here use pipeline w/ fairseq2 backend for model's inference and pipeline includes watermarking.
|
97 |
-
|
98 |
-
| Language | Partition | ASR-BLEU | Vocal Style Sim | AutoPCP | Pause | Rate |
|
99 |
-
|----------|-----------|----------|-------------|---------|-------|------|
|
100 |
-
| eng_cmn | dev | 26.080 | 0.207 | 3.168 | 0.236 | 0.538 |
|
101 |
-
| eng_deu | dev | 36.940 | 0.261 | 3.298 | 0.319 | 0.717 |
|
102 |
-
| eng_fra | dev | 37.780 | 0.231 | 3.285 | 0.331 | 0.682 |
|
103 |
-
| eng_ita | dev | 40.170 | 0.226 | 3.322 | 0.388 | 0.734 |
|
104 |
-
| eng_spa | dev | 42.400 | 0.228 | 3.379 | 0.332 | 0.702 |
|
105 |
-
| eng_cmn | test | 23.320 | 0.249 | 2.984 | 0.385 | 0.522 |
|
106 |
-
| eng_deu | test | 27.780 | 0.290 | 3.117 | 0.483 | 0.717 |
|
107 |
-
| eng_fra | test | 38.360 | 0.270 | 3.117 | 0.506 | 0.663 |
|
108 |
-
| eng_ita | test | 38.020 | 0.274 | 3.130 | 0.523 | 0.686 |
|
109 |
-
| eng_spa | test | 42.920 | 0.274 | 3.183 | 0.508 | 0.675 |
|
110 |
-
|
111 |
-
### Step-by-step evaluation
|
112 |
-
|
113 |
-
Pre-requisite: all steps described here assume that the generation/inference has been completed following steps from *TBD inference link*.
|
114 |
-
|
115 |
-
For stopes installation please refer to *TBD stopes installation link*.
|
116 |
-
|
117 |
-
The resulting directory of generated outputs:
|
118 |
-
```bash
|
119 |
-
export SPLIT="dev_mexpresso_eng_spa" # example, change for your split
|
120 |
-
export TGT_LANG="spa"
|
121 |
-
export SRC_LANG="eng"
|
122 |
-
export GENERATED_DIR="path_to_generated_output_for_given_data_split"
|
123 |
-
export STOPES_ROOT="path_to_stopes_code_repo"
|
124 |
-
export SC_ROOT="path_to_this_repo"
|
125 |
-
```
|
126 |
-
|
127 |
-
**ASR-BLEU evaluation**
|
128 |
-
|
129 |
-
```bash
|
130 |
-
python ${SC_ROOT}/src/seamless_communication/cli/expressivity/evaluate/run_asr_bleu.py \
|
131 |
-
--generation_dir_path=${GENERATED_DIR} \
|
132 |
-
--generate_tsv_filename=generate-${SPLIT}.tsv \
|
133 |
-
--tgt_lang=${TGT_LANG}
|
134 |
-
```
|
135 |
-
* `generate-${SPLIT}.tsv` is an expected output from inference described in pre-requisite
|
136 |
-
* `run_asr_bleu.py` creates an additional manifest called `output_manifest.tsv` inside `--generation_dir_path` which includes all relevant columns needed for this evaluation
|
137 |
-
|
138 |
-
After completion resulting ASR-BLEU score is written in `${GENERATED_DIR}/s2st_asr_bleu_normalized.json`.
|
139 |
-
|
140 |
-
**Vocal Style Similarity**
|
141 |
-
|
142 |
-
Download & set WavLM finetuned ckpt path (`${SPKSIM_MODEL_PATH}`) as described in [stopes README](https://github.com/fairinternal/seamless_common/tree/main/stopes/eval/spkr_similarity#pre-requisites) (FIX PRIVATE LINK) to reproduce our vocal style similarity eval.
|
143 |
-
|
144 |
-
```bash
|
145 |
-
python -m stopes.modules +speaker_similarity=base \
|
146 |
-
launcher.cluster=local \
|
147 |
-
speaker_similarity.model_type=valle \
|
148 |
-
+speaker_similarity.model_path=${SPKSIM_MODEL_PATH} \
|
149 |
-
+speaker_similarity.input_file=${GENERATED_DIR}/output_manifest.tsv \
|
150 |
-
+speaker_similarity.output_file=${GENERATED_DIR}/spksim_result.txt \
|
151 |
-
speaker_similarity.named_columns=true \
|
152 |
-
speaker_similarity.src_audio_column=audio \
|
153 |
-
speaker_similarity.tgt_audio_column=hypo_audio
|
154 |
-
```
|
155 |
-
* We report average number from all utterance scores written in `${GENERATED_DIR}/spksim_result.txt`.
|
156 |
-
|
157 |
-
**AutoPCP**
|
158 |
-
|
159 |
-
```bash
|
160 |
-
python -m stopes.modules +compare_audios=AutoPCP_multilingual_v2 \
|
161 |
-
launcher.cluster=local \
|
162 |
-
+compare_audios.input_file=${GENERATED_DIR}/output_manifest.tsv \
|
163 |
-
compare_audios.src_audio_column=audio \
|
164 |
-
compare_audios.tgt_audio_column=hypo_audio \
|
165 |
-
+compare_audios.named_columns=true \
|
166 |
-
+compare_audios.output_file=${GENERATED_DIR}/autopcp_result.txt
|
167 |
-
```
|
168 |
-
* We report average number from all utterance scores written in `${GENERATED_DIR}/autopcp_result.txt`.
|
169 |
-
|
170 |
-
**Pause and Rate**
|
171 |
-
|
172 |
-
This stage includes 3 steps: (1) src lang annotation, (2) tgt lang annotation, (3) pairwise comparison
|
173 |
-
|
174 |
-
```bash
|
175 |
-
# src lang pause&rate annotation
|
176 |
-
python ${STOPES_ROOT}/stopes/eval/local_prosody/annotate_utterances.py \
|
177 |
-
+data_path=${GENERATED_DIR}/output_manifest.tsv \
|
178 |
-
+result_path=${GENERATED_DIR}/${SRC_LANG}_speech_rate_pause_annotation.tsv \
|
179 |
-
+audio_column=audio \
|
180 |
-
+text_column=raw_src_text \
|
181 |
-
+speech_units=[syllable] \
|
182 |
-
+vad=true \
|
183 |
-
+net=true \
|
184 |
-
+lang=$SRC_LANG \
|
185 |
-
+forced_aligner=fairseq2_nar_t2u_aligner
|
186 |
-
|
187 |
-
# tgt lang pause&rate annotation
|
188 |
-
python ${STOPES_ROOT}/stopes/eval/local_prosody/annotate_utterances.py \
|
189 |
-
+data_path=${GENERATED_DIR}/output_manifest.tsv \
|
190 |
-
+result_path=${GENERATED_DIR}/${TGT_LANG}_speech_rate_pause_annotation.tsv \
|
191 |
-
+audio_column=hypo_audio \
|
192 |
-
+text_column=s2t_out \
|
193 |
-
+speech_units=[syllable] \
|
194 |
-
+vad=true \
|
195 |
-
+net=true \
|
196 |
-
+lang=$TGT_LANG \
|
197 |
-
+forced_aligner=fairseq2_nar_t2u_aligner
|
198 |
-
|
199 |
-
# pair wise comparison
|
200 |
-
python ${STOPES_ROOT}/stopes/eval/local_prosody/compare_utterances.py \
|
201 |
-
+src_path=${GENERATED_DIR}/${SRC_LANG}_speech_rate_pause_annotation.tsv \
|
202 |
-
+tgt_path=${GENERATED_DIR}/${TGT_LANG}_speech_rate_pause_annotation.tsv \
|
203 |
-
+result_path=${GENERATED_DIR}/${SRC_LANG}_${TGT_LANG}_pause_scores.tsv \
|
204 |
-
+pause_min_duration=0.1
|
205 |
-
```
|
206 |
-
|
207 |
-
* For Rate reporting, please see the aggregation function `get_rate` in `${SC_ROOT}/src/seamless_communication/cli/expressivity/evaluate/post_process_pauserate.py`;
|
208 |
-
* For Pause reporting, please see the aggregation function `get_pause` in `${SC_ROOT}/src/seamless_communication/cli/expressivity/evaluate/post_process_pauserate.py`.
|
|
|
1 |
---
|
2 |
license: unknown
|
3 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|