stp99 commited on
Commit
469e9fd
·
verified ·
1 Parent(s): 7a1e92f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -11
README.md CHANGED
@@ -11,15 +11,13 @@ model-index:
11
 
12
  The SpeechLMM 1.0 collection of multimodal and multilingual large language models is a collection of instruction-tuned generative models in 4 different sizes: S (2B), M (4B), L (9B) and XL (71B), supporting text, audio and video as input and only text as output. The SpeechLMM 1.0 models are optimized for various X-to-text generation tasks, namely:
13
 
14
- - Text-to-text Instruction Following
15
  - Machine Translation
16
- - Text Summarization
17
  - Automatic Speech Recognition
18
  - Speech Translation
19
  - Speech Summarization
20
  - Spoken Question Answering
21
- - Spoken Language Understanding
22
- - Visual Speech Recognition
23
 
24
  **Model Developer:** Meetween consortium
25
 
@@ -40,7 +38,7 @@ SpeechLMM 1.0 an auto-regressive multimodal language model based on a Llama 3.X
40
  | SpeechLMM 1.0 S | 2B (2.17B) | Multilingual text and audio, English video | Multilingual Text | 128k |
41
  | SpeechLMM 1.0 M | 4B (4.15B) | Multilingual text and audio, English video | Multilingual Text | 128k |
42
  | SpeechLMM 1.0 L | 9B (8.98B) | Multilingual text and audio, English video | Multilingual Text | 128k |
43
- | SpeechLMM 1.0 XL | 71B (71.5B) | Multilingual text and audio, English video | Multilingual Text | 128k |
44
 
45
  #### Audio and video encoders
46
 
@@ -61,7 +59,7 @@ For all the 4 sizes of SpeechLMM 1.0, the audio and video adapters are:
61
  | SpeechLMM 1.0 S | Llama 3.2 1B Instruct |
62
  | SpeechLMM 1.0 M | Llama 3.2 3B Instruct |
63
  | SpeechLMM 1.0 L | Llama 3.1 8B Instruct |
64
- | SpeechLMM 1.0 XL | Llama 3.3 70B Instruct |
65
 
66
  ## How to use
67
 
@@ -82,7 +80,7 @@ Important: before you can use this model, you must follow these steps:
82
  3. Download the Auto-AVSR video encoder weights from [here](https://drive.google.com/file/d/1shcWXUK2iauRhW9NbwCc25FjU1CoMm8i/view?usp=sharing) and put them in `path/to/some_directory_2`
83
  4. Go to `config.json` and change the `video_encoder._name_or_path` to `path/to/some_directory_2/vsr_trlrs3vox2_base.pth`
84
 
85
- ## Training and evaluation data
86
 
87
  ### Monolingual
88
 
@@ -94,12 +92,8 @@ Important: before you can use this model, you must follow these steps:
94
  | | | **Spoken SQUAD** | CC-BY-SA-4.0 | |
95
  | | | **Speech-Massive** | CC-BY-NC-SA-4.0 | |
96
  | **VSR** | Visual Speech Recognition | **LRS2-BBC** | Custom | WER |
97
- | **TSUM** | Text Summarization | **AMI** | CC-BY-4.0 | Rouge-1, Rouge-2, Rouge-L |
98
- | | | **ICSI** | CC-BY-SA-4.0 | |
99
  | **SSUM** | Speech Summarization | **AMI** | CC-BY-4.0 | Rouge-1, Rouge-2, Rouge-L |
100
  | | | **ICSI** | CC-BY-4.0 | |
101
- | **TTS** | Text-to-Speech Synthesis | **LibriTTS** | CC-BY-4.0 | WER, CER, SSIM, UTMOS |
102
- | | | **LJSpeech** | Public domain | |
103
  | **SQA** | Spoken Question Answering | **Spoken SQUAD** | CC-BY-SA-4.0 | Accuracy, Exact Match, F1 |
104
 
105
  ### Multilingual
@@ -117,6 +111,9 @@ Important: before you can use this model, you must follow these steps:
117
  | **SLU** | Spoken Language Understanding | **Speech-Massive** | CC-BY-NC-SA-4.0 | Intent Accuracy |
118
  | | | **SLURP** | CC BY 4.0 (text) <br> CC BY-NC 4.0 (audio) | |
119
 
 
 
 
120
  ## Framework versions
121
 
122
  - Transformers 4.45.0
 
11
 
12
  The SpeechLMM 1.0 collection of multimodal and multilingual large language models is a collection of instruction-tuned generative models in 4 different sizes: S (2B), M (4B), L (9B) and XL (71B), supporting text, audio and video as input and only text as output. The SpeechLMM 1.0 models are optimized for various X-to-text generation tasks, namely:
13
 
 
14
  - Machine Translation
 
15
  - Automatic Speech Recognition
16
  - Speech Translation
17
  - Speech Summarization
18
  - Spoken Question Answering
19
+ - Spoken Language Understanding (beta)
20
+ - Visual Speech Recognition (beta)
21
 
22
  **Model Developer:** Meetween consortium
23
 
 
38
  | SpeechLMM 1.0 S | 2B (2.17B) | Multilingual text and audio, English video | Multilingual Text | 128k |
39
  | SpeechLMM 1.0 M | 4B (4.15B) | Multilingual text and audio, English video | Multilingual Text | 128k |
40
  | SpeechLMM 1.0 L | 9B (8.98B) | Multilingual text and audio, English video | Multilingual Text | 128k |
41
+ | SpeechLMM 1.0 XL (beta) | 71B (71.5B) | Multilingual text and audio, English video | Multilingual Text | 128k |
42
 
43
  #### Audio and video encoders
44
 
 
59
  | SpeechLMM 1.0 S | Llama 3.2 1B Instruct |
60
  | SpeechLMM 1.0 M | Llama 3.2 3B Instruct |
61
  | SpeechLMM 1.0 L | Llama 3.1 8B Instruct |
62
+ | SpeechLMM 1.0 XL (beta) | Llama 3.3 70B Instruct |
63
 
64
  ## How to use
65
 
 
80
  3. Download the Auto-AVSR video encoder weights from [here](https://drive.google.com/file/d/1shcWXUK2iauRhW9NbwCc25FjU1CoMm8i/view?usp=sharing) and put them in `path/to/some_directory_2`
81
  4. Go to `config.json` and change the `video_encoder._name_or_path` to `path/to/some_directory_2/vsr_trlrs3vox2_base.pth`
82
 
83
+ ## Training data
84
 
85
  ### Monolingual
86
 
 
92
  | | | **Spoken SQUAD** | CC-BY-SA-4.0 | |
93
  | | | **Speech-Massive** | CC-BY-NC-SA-4.0 | |
94
  | **VSR** | Visual Speech Recognition | **LRS2-BBC** | Custom | WER |
 
 
95
  | **SSUM** | Speech Summarization | **AMI** | CC-BY-4.0 | Rouge-1, Rouge-2, Rouge-L |
96
  | | | **ICSI** | CC-BY-4.0 | |
 
 
97
  | **SQA** | Spoken Question Answering | **Spoken SQUAD** | CC-BY-SA-4.0 | Accuracy, Exact Match, F1 |
98
 
99
  ### Multilingual
 
111
  | **SLU** | Spoken Language Understanding | **Speech-Massive** | CC-BY-NC-SA-4.0 | Intent Accuracy |
112
  | | | **SLURP** | CC BY 4.0 (text) <br> CC BY-NC 4.0 (audio) | |
113
 
114
+ ## Evaluation data
115
+ coming soon...
116
+
117
  ## Framework versions
118
 
119
  - Transformers 4.45.0