meetween
/

Llama-speechlmm-1.0-xl

Transformers

Safetensors

llava

Generated from Trainer

Model card Files Files and versions Community

stp99 commited on 24 days ago

Commit

469e9fd

verified ·

1 Parent(s): 7a1e92f

Update README.md

Browse files

Files changed (1) hide show

README.md +8 -11

README.md CHANGED Viewed

@@ -11,15 +11,13 @@ model-index:
 The SpeechLMM 1.0 collection of multimodal and multilingual large language models is a collection of instruction-tuned generative models in 4 different sizes: S (2B), M (4B), L (9B) and XL (71B), supporting text, audio and video as input and only text as output. The SpeechLMM 1.0 models are optimized for various X-to-text generation tasks, namely:
-- Text-to-text Instruction Following
 - Machine Translation
-- Text Summarization
 - Automatic Speech Recognition
 - Speech Translation
 - Speech Summarization
 - Spoken Question Answering
-- Spoken Language Understanding
-- Visual Speech Recognition
 **Model Developer:** Meetween consortium
@@ -40,7 +38,7 @@ SpeechLMM 1.0 an auto-regressive multimodal language model based on a Llama 3.X
 | SpeechLMM 1.0 S  | 2B (2.17B)  | Multilingual text and audio, English video | Multilingual Text | 128k           |
 | SpeechLMM 1.0 M  | 4B (4.15B)  | Multilingual text and audio, English video | Multilingual Text | 128k           |
 | SpeechLMM 1.0 L  | 9B (8.98B)  | Multilingual text and audio, English video | Multilingual Text | 128k           |
-| SpeechLMM 1.0 XL | 71B (71.5B) | Multilingual text and audio, English video | Multilingual Text | 128k           |
 #### Audio and video encoders
@@ -61,7 +59,7 @@ For all the 4 sizes of SpeechLMM 1.0, the audio and video adapters are:
 | SpeechLMM 1.0 S  | Llama 3.2 1B Instruct  |
 | SpeechLMM 1.0 M  | Llama 3.2 3B Instruct  |
 | SpeechLMM 1.0 L  | Llama 3.1 8B Instruct  |
-| SpeechLMM 1.0 XL | Llama 3.3 70B Instruct |
 ## How to use
@@ -82,7 +80,7 @@ Important: before you can use this model, you must follow these steps:
 3. Download the Auto-AVSR video encoder weights from [here](https://drive.google.com/file/d/1shcWXUK2iauRhW9NbwCc25FjU1CoMm8i/view?usp=sharing) and put them in `path/to/some_directory_2`
 4. Go to `config.json` and change the `video_encoder._name_or_path` to `path/to/some_directory_2/vsr_trlrs3vox2_base.pth`
-## Training and evaluation data
 ### Monolingual
@@ -94,12 +92,8 @@ Important: before you can use this model, you must follow these steps:
 |          |                              | **Spoken SQUAD**   | CC-BY-SA-4.0    |                           |
 |          |                              | **Speech-Massive** | CC-BY-NC-SA-4.0 |                           |
 | **VSR**  | Visual Speech Recognition    | **LRS2-BBC**       | Custom          | WER                   |
-| **TSUM** | Text Summarization           | **AMI**            | CC-BY-4.0       | Rouge-1, Rouge-2, Rouge-L |
-|          |                              | **ICSI**           | CC-BY-SA-4.0    |                           |
 | **SSUM** | Speech Summarization         | **AMI**            | CC-BY-4.0       | Rouge-1, Rouge-2, Rouge-L |
 |          |                              | **ICSI**           | CC-BY-4.0       |                           |
-| **TTS**  | Text-to-Speech Synthesis     | **LibriTTS**       | CC-BY-4.0       | WER, CER, SSIM, UTMOS     |
-|          |                              | **LJSpeech**       | Public domain   |                           |
 | **SQA**  | Spoken Question Answering    | **Spoken SQUAD**   | CC-BY-SA-4.0    | Accuracy, Exact Match, F1 |
 ### Multilingual
@@ -117,6 +111,9 @@ Important: before you can use this model, you must follow these steps:
 | **SLU**          | Spoken Language Understanding | **Speech-Massive**                   | CC-BY-NC-SA-4.0                            | Intent Accuracy     |
 |                  |                               | **SLURP**                            | CC BY 4.0 (text) <br> CC BY-NC 4.0 (audio) |                     |
 ## Framework versions
 - Transformers 4.45.0

 The SpeechLMM 1.0 collection of multimodal and multilingual large language models is a collection of instruction-tuned generative models in 4 different sizes: S (2B), M (4B), L (9B) and XL (71B), supporting text, audio and video as input and only text as output. The SpeechLMM 1.0 models are optimized for various X-to-text generation tasks, namely:
 - Machine Translation
 - Automatic Speech Recognition
 - Speech Translation
 - Speech Summarization
 - Spoken Question Answering
+- Spoken Language Understanding (beta)
+- Visual Speech Recognition (beta)
 **Model Developer:** Meetween consortium
 | SpeechLMM 1.0 S  | 2B (2.17B)  | Multilingual text and audio, English video | Multilingual Text | 128k           |
 | SpeechLMM 1.0 M  | 4B (4.15B)  | Multilingual text and audio, English video | Multilingual Text | 128k           |
 | SpeechLMM 1.0 L  | 9B (8.98B)  | Multilingual text and audio, English video | Multilingual Text | 128k           |
+| SpeechLMM 1.0 XL (beta) | 71B (71.5B) | Multilingual text and audio, English video | Multilingual Text | 128k           |
 #### Audio and video encoders
 | SpeechLMM 1.0 S  | Llama 3.2 1B Instruct  |
 | SpeechLMM 1.0 M  | Llama 3.2 3B Instruct  |
 | SpeechLMM 1.0 L  | Llama 3.1 8B Instruct  |
+| SpeechLMM 1.0 XL (beta) | Llama 3.3 70B Instruct |
 ## How to use
 3. Download the Auto-AVSR video encoder weights from [here](https://drive.google.com/file/d/1shcWXUK2iauRhW9NbwCc25FjU1CoMm8i/view?usp=sharing) and put them in `path/to/some_directory_2`
 4. Go to `config.json` and change the `video_encoder._name_or_path` to `path/to/some_directory_2/vsr_trlrs3vox2_base.pth`
+## Training data
 ### Monolingual
 |          |                              | **Spoken SQUAD**   | CC-BY-SA-4.0    |                           |
 |          |                              | **Speech-Massive** | CC-BY-NC-SA-4.0 |                           |
 | **VSR**  | Visual Speech Recognition    | **LRS2-BBC**       | Custom          | WER                   |
 | **SSUM** | Speech Summarization         | **AMI**            | CC-BY-4.0       | Rouge-1, Rouge-2, Rouge-L |
 |          |                              | **ICSI**           | CC-BY-4.0       |                           |
 | **SQA**  | Spoken Question Answering    | **Spoken SQUAD**   | CC-BY-SA-4.0    | Accuracy, Exact Match, F1 |
 ### Multilingual
 | **SLU**          | Spoken Language Understanding | **Speech-Massive**                   | CC-BY-NC-SA-4.0                            | Intent Accuracy     |
 |                  |                               | **SLURP**                            | CC BY 4.0 (text) <br> CC BY-NC 4.0 (audio) |                     |
+## Evaluation data
+coming soon...
 ## Framework versions
 - Transformers 4.45.0