Update README.md
Browse files
README.md
CHANGED
@@ -11,15 +11,13 @@ model-index:
|
|
11 |
|
12 |
The SpeechLMM 1.0 collection of multimodal and multilingual large language models is a collection of instruction-tuned generative models in 4 different sizes: S (2B), M (4B), L (9B) and XL (71B), supporting text, audio and video as input and only text as output. The SpeechLMM 1.0 models are optimized for various X-to-text generation tasks, namely:
|
13 |
|
14 |
-
- Text-to-text Instruction Following
|
15 |
- Machine Translation
|
16 |
-
- Text Summarization
|
17 |
- Automatic Speech Recognition
|
18 |
- Speech Translation
|
19 |
- Speech Summarization
|
20 |
- Spoken Question Answering
|
21 |
-
- Spoken Language Understanding
|
22 |
-
- Visual Speech Recognition
|
23 |
|
24 |
**Model Developer:** Meetween consortium
|
25 |
|
@@ -40,7 +38,7 @@ SpeechLMM 1.0 an auto-regressive multimodal language model based on a Llama 3.X
|
|
40 |
| SpeechLMM 1.0 S | 2B (2.17B) | Multilingual text and audio, English video | Multilingual Text | 128k |
|
41 |
| SpeechLMM 1.0 M | 4B (4.15B) | Multilingual text and audio, English video | Multilingual Text | 128k |
|
42 |
| SpeechLMM 1.0 L | 9B (8.98B) | Multilingual text and audio, English video | Multilingual Text | 128k |
|
43 |
-
| SpeechLMM 1.0 XL | 71B (71.5B) | Multilingual text and audio, English video | Multilingual Text | 128k |
|
44 |
|
45 |
#### Audio and video encoders
|
46 |
|
@@ -61,7 +59,7 @@ For all the 4 sizes of SpeechLMM 1.0, the audio and video adapters are:
|
|
61 |
| SpeechLMM 1.0 S | Llama 3.2 1B Instruct |
|
62 |
| SpeechLMM 1.0 M | Llama 3.2 3B Instruct |
|
63 |
| SpeechLMM 1.0 L | Llama 3.1 8B Instruct |
|
64 |
-
| SpeechLMM 1.0 XL | Llama 3.3 70B Instruct |
|
65 |
|
66 |
## How to use
|
67 |
|
@@ -82,7 +80,7 @@ Important: before you can use this model, you must follow these steps:
|
|
82 |
3. Download the Auto-AVSR video encoder weights from [here](https://drive.google.com/file/d/1shcWXUK2iauRhW9NbwCc25FjU1CoMm8i/view?usp=sharing) and put them in `path/to/some_directory_2`
|
83 |
4. Go to `config.json` and change the `video_encoder._name_or_path` to `path/to/some_directory_2/vsr_trlrs3vox2_base.pth`
|
84 |
|
85 |
-
## Training
|
86 |
|
87 |
### Monolingual
|
88 |
|
@@ -94,12 +92,8 @@ Important: before you can use this model, you must follow these steps:
|
|
94 |
| | | **Spoken SQUAD** | CC-BY-SA-4.0 | |
|
95 |
| | | **Speech-Massive** | CC-BY-NC-SA-4.0 | |
|
96 |
| **VSR** | Visual Speech Recognition | **LRS2-BBC** | Custom | WER |
|
97 |
-
| **TSUM** | Text Summarization | **AMI** | CC-BY-4.0 | Rouge-1, Rouge-2, Rouge-L |
|
98 |
-
| | | **ICSI** | CC-BY-SA-4.0 | |
|
99 |
| **SSUM** | Speech Summarization | **AMI** | CC-BY-4.0 | Rouge-1, Rouge-2, Rouge-L |
|
100 |
| | | **ICSI** | CC-BY-4.0 | |
|
101 |
-
| **TTS** | Text-to-Speech Synthesis | **LibriTTS** | CC-BY-4.0 | WER, CER, SSIM, UTMOS |
|
102 |
-
| | | **LJSpeech** | Public domain | |
|
103 |
| **SQA** | Spoken Question Answering | **Spoken SQUAD** | CC-BY-SA-4.0 | Accuracy, Exact Match, F1 |
|
104 |
|
105 |
### Multilingual
|
@@ -117,6 +111,9 @@ Important: before you can use this model, you must follow these steps:
|
|
117 |
| **SLU** | Spoken Language Understanding | **Speech-Massive** | CC-BY-NC-SA-4.0 | Intent Accuracy |
|
118 |
| | | **SLURP** | CC BY 4.0 (text) <br> CC BY-NC 4.0 (audio) | |
|
119 |
|
|
|
|
|
|
|
120 |
## Framework versions
|
121 |
|
122 |
- Transformers 4.45.0
|
|
|
11 |
|
12 |
The SpeechLMM 1.0 collection of multimodal and multilingual large language models is a collection of instruction-tuned generative models in 4 different sizes: S (2B), M (4B), L (9B) and XL (71B), supporting text, audio and video as input and only text as output. The SpeechLMM 1.0 models are optimized for various X-to-text generation tasks, namely:
|
13 |
|
|
|
14 |
- Machine Translation
|
|
|
15 |
- Automatic Speech Recognition
|
16 |
- Speech Translation
|
17 |
- Speech Summarization
|
18 |
- Spoken Question Answering
|
19 |
+
- Spoken Language Understanding (beta)
|
20 |
+
- Visual Speech Recognition (beta)
|
21 |
|
22 |
**Model Developer:** Meetween consortium
|
23 |
|
|
|
38 |
| SpeechLMM 1.0 S | 2B (2.17B) | Multilingual text and audio, English video | Multilingual Text | 128k |
|
39 |
| SpeechLMM 1.0 M | 4B (4.15B) | Multilingual text and audio, English video | Multilingual Text | 128k |
|
40 |
| SpeechLMM 1.0 L | 9B (8.98B) | Multilingual text and audio, English video | Multilingual Text | 128k |
|
41 |
+
| SpeechLMM 1.0 XL (beta) | 71B (71.5B) | Multilingual text and audio, English video | Multilingual Text | 128k |
|
42 |
|
43 |
#### Audio and video encoders
|
44 |
|
|
|
59 |
| SpeechLMM 1.0 S | Llama 3.2 1B Instruct |
|
60 |
| SpeechLMM 1.0 M | Llama 3.2 3B Instruct |
|
61 |
| SpeechLMM 1.0 L | Llama 3.1 8B Instruct |
|
62 |
+
| SpeechLMM 1.0 XL (beta) | Llama 3.3 70B Instruct |
|
63 |
|
64 |
## How to use
|
65 |
|
|
|
80 |
3. Download the Auto-AVSR video encoder weights from [here](https://drive.google.com/file/d/1shcWXUK2iauRhW9NbwCc25FjU1CoMm8i/view?usp=sharing) and put them in `path/to/some_directory_2`
|
81 |
4. Go to `config.json` and change the `video_encoder._name_or_path` to `path/to/some_directory_2/vsr_trlrs3vox2_base.pth`
|
82 |
|
83 |
+
## Training data
|
84 |
|
85 |
### Monolingual
|
86 |
|
|
|
92 |
| | | **Spoken SQUAD** | CC-BY-SA-4.0 | |
|
93 |
| | | **Speech-Massive** | CC-BY-NC-SA-4.0 | |
|
94 |
| **VSR** | Visual Speech Recognition | **LRS2-BBC** | Custom | WER |
|
|
|
|
|
95 |
| **SSUM** | Speech Summarization | **AMI** | CC-BY-4.0 | Rouge-1, Rouge-2, Rouge-L |
|
96 |
| | | **ICSI** | CC-BY-4.0 | |
|
|
|
|
|
97 |
| **SQA** | Spoken Question Answering | **Spoken SQUAD** | CC-BY-SA-4.0 | Accuracy, Exact Match, F1 |
|
98 |
|
99 |
### Multilingual
|
|
|
111 |
| **SLU** | Spoken Language Understanding | **Speech-Massive** | CC-BY-NC-SA-4.0 | Intent Accuracy |
|
112 |
| | | **SLURP** | CC BY 4.0 (text) <br> CC BY-NC 4.0 (audio) | |
|
113 |
|
114 |
+
## Evaluation data
|
115 |
+
coming soon...
|
116 |
+
|
117 |
## Framework versions
|
118 |
|
119 |
- Transformers 4.45.0
|