reach-vb HF staff commited on
Commit
502d6e7
·
1 Parent(s): 091083a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +131 -0
README.md ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SeamlessM4T
2
+ SeamlessM4T is designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.
3
+
4
+ SeamlessM4T covers:
5
+ - 📥 101 languages for speech input
6
+ - ⌨️ 96 Languages for text input/output
7
+ - 🗣️ 35 languages for speech output.
8
+
9
+ This unified model enables multiple tasks without relying on multiple separate models:
10
+ - Speech-to-speech translation (S2ST)
11
+ - Speech-to-text translation (S2TT)
12
+ - Text-to-speech translation (T2ST)
13
+ - Text-to-text translation (T2TT)
14
+ - Automatic speech recognition (ASR)
15
+
16
+ ## SeamlessM4T models
17
+ | Model Name | #params | checkpoint | metrics |
18
+ | - | - | - | - |
19
+ | SeamlessM4T-Large | 2.3B |[model]() | [metrics]() |
20
+ | SeamlessM4T-Medium | 1.2B |[model]() | [metrics]() |
21
+
22
+ We provide the extensive evaluation results of seamlessM4T-Large and SeamlessM4T-Medium reported in the paper (as averages) in the `metrics` files above.
23
+
24
+ ## Instructions to run inference with SeamlessM4T models
25
+
26
+ Inference calls for the `Translator` object instanciated with a Multitasking UnitY model with the options:
27
+ - `multitask_unity_large`
28
+ - `multitask_unity_medium`
29
+
30
+ and a vocoder `vocoder_36langs`
31
+
32
+ ```python
33
+ import torch
34
+ import torchaudio
35
+ from seamless_communication.models.inference import Translator
36
+
37
+
38
+ # Initialize a Translator object with a multitask model, vocoder on the GPU.
39
+ translator = Translator("multitask_unity_large", "vocoder_36langs", torch.device("cuda:0"))
40
+ ```
41
+
42
+ Now `predict()` can be used to run inference as many times on any of the supported tasks.
43
+
44
+ Given an input audio with `<path_to_input_audio>` or an input text `<input_text>` in `<src_lang>`,
45
+ we can translate into `<tgt_lang>` as follows:
46
+
47
+ ### S2ST and T2ST:
48
+
49
+ ```python
50
+ # S2ST
51
+ translated_text, wav, sr = translator.predict(<path_to_input_audio>, "s2st", <tgt_lang>)
52
+
53
+ # T2ST
54
+ translated_text, wav, sr = translator.predict(<input_text>, "t2st", <tgt_lang>, src_lang=<src_lang>)
55
+
56
+ ```
57
+ Note that `<src_lang>` must be specified for T2ST.
58
+
59
+ The generated units are synthesized and the output audio file is saved with:
60
+
61
+ ```python
62
+ wav, sr = translator.synthesize_speech(<speech_units>, <tgt_lang>)
63
+
64
+ # Save the translated audio generation.
65
+ torchaudio.save(
66
+ <path_to_save_audio>,
67
+ wav[0].cpu(),
68
+ sample_rate=sr,
69
+ )
70
+ ```
71
+
72
+ ### S2TT, T2TT and ASR:
73
+
74
+ ```python
75
+ # S2TT
76
+ translated_text, _, _ = translator.predict(<path_to_input_audio>, "s2tt", <tgt_lang>)
77
+
78
+ # ASR
79
+ # This is equivalent to S2TT with `<tgt_lang>=<src_lang>`.
80
+ transcribed_text, _, _ = translator.predict(<path_to_input_audio>, "asr", <src_lang>)
81
+
82
+ # T2TT
83
+ translated_text, _, _ = translator.predict(<input_text>, "t2tt", <tgt_lang>, src_lang=<src_lang>)
84
+
85
+ ```
86
+ Note that `<src_lang>` must be specified for T2TT
87
+
88
+
89
+ ### Inference using the CLI, from the root directory of the repository:
90
+
91
+ The model can be specified with e.g., `--model_name multitask_unity_large`:
92
+
93
+ S2ST:
94
+ ```
95
+ python scripts/m4t/predict/predict.py <path_to_input_audio> s2st <tgt_lang> --output_path <path_to_save_audio> --model_name multitask_unity_large
96
+ ```
97
+
98
+ S2TT:
99
+ ```
100
+ python scripts/m4t/predict/predict.py <path_to_input_audio> s2tt <tgt_lang>
101
+ ```
102
+
103
+ T2TT:
104
+ ```
105
+ python scripts/m4t/predict/predict.py <input_text> t2tt <tgt_lang> --src_lang <src_lang>
106
+ ```
107
+
108
+ T2ST:
109
+ ```
110
+ python scripts/m4t/predict/predict.py <input_text> t2st <tgt_lang> --src_lang <src_lang> --output_path <path_to_save_audio>
111
+ ```
112
+
113
+ ASR:
114
+ ```
115
+ python scripts/m4t/predict/predict.py <path_to_input_audio> asr <tgt_lang>
116
+ ```
117
+
118
+ ## Citation
119
+ If you use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite :
120
+
121
+ ```bibtex
122
+ @article{seamlessm4t2023,
123
+ title={SeamlessM4T—Massively Multilingual \& Multimodal Machine Translation},
124
+ author={{Seamless Communication}, Lo\"{i}c Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-juss\`{a} \footnotemark[3], Onur \,{C}elebi,Maha Elbayad,Cynthia Gao, Francisco Guzm\'an, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang},
125
+ journal={ArXiv},
126
+ year={2023}
127
+ }
128
+ ```
129
+ ## License
130
+
131
+ seamless_communication is CC-BY-NC 4.0 licensed.