--- library_name: transformers tags: - generated_from_trainer model-index: - name: Llama-speechlmm-1.0-s results: [] --- ## Model information The SpeechLMM 1.0 collection of multimodal and multilingual large language models is a collection of instruction-tuned generative models in 4 different sizes: S (2B), M (4B), L (9B) and XL (71B), supporting text, audio and video as input and only text as output. The SpeechLMM 1.0 models are optimized for various X-to-text generation tasks, namely: - Machine Translation - Automatic Speech Recognition - Speech Translation - Speech Summarization - Spoken Question Answering - Spoken Language Understanding (beta) - Visual Speech Recognition (beta) **Model Developer:** Meetween consortium **Supported Languages:** English, French, Italian, German, and Spanish are officially supported (for a subset of the supported tasks). The Llama 3.X backbone and the SeamlessM4T v2 audio encoder have been trained on a broader collection of languages than these 5 supported languages, so the model might exhibit good performance on other languages too. **Model Release Date:** Feb 28, 2025 **License:** see [LICENSE](LICENSE) ### Model Architecture SpeechLMM 1.0 an auto-regressive multimodal language model based on a Llama 3.X backbone (X varies with the model size), a speech-specific stack consisting of a pre-trained audio encoder ([SeamlessM4T v2](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/)) and an audio adapter, and a video-specific stack consisting of a pre-trained video encoder ([Auto-AVSR](https://ieeexplore.ieee.org/document/10096889)) and a video adapter. | Model | Params | Input modalities | Output modalities | Context Length | |:---------------- |:----------- |:------------------------------------------ |:----------------- |:-------------- | | SpeechLMM 1.0 S | 2B (2.17B) | Multilingual text and audio, English video | Multilingual Text | 128k | | SpeechLMM 1.0 M | 4B (4.15B) | Multilingual text and audio, English video | Multilingual Text | 128k | | SpeechLMM 1.0 L | 9B (8.98B) | Multilingual text and audio, English video | Multilingual Text | 128k | | SpeechLMM 1.0 XL (beta) | 71B (71.5B) | Multilingual text and audio, English video | Multilingual Text | 128k | #### Audio and video encoders For all the 4 sizes of SpeechLMM 1.0, the audio encoder is **SeamlessM4T v2 Large** (`facebook/seamless-m4t-v2-large`) and the video encoder is **Auto-AVSR** (`vsr_trlrs3vox2_base`). #### Audio and video adapters For all the 4 sizes of SpeechLMM 1.0, the audio and video adapters are: | Modality | Architecture | Number of layers | Compression factor | | :------- | :----------- | :--------------- | :----------------- | | Audio | MLP | 4 | 1 | | Video | Window-level Q-former
(4 queries) | 4 | 4 | #### LLM backbone | Model | Backbone | |:---------------- |:---------------------- | | SpeechLMM 1.0 S | Llama 3.2 1B Instruct | | SpeechLMM 1.0 M | Llama 3.2 3B Instruct | | SpeechLMM 1.0 L | Llama 3.1 8B Instruct | | SpeechLMM 1.0 XL (beta) | Llama 3.3 70B Instruct | ## How to use Currently, this model can only be used via our [`speechlmm`](https://github.com/meetween/speechlmm) codebase. Refer to the instructions there for more details. Important: before you can use this model, you must download the SeamlessM4T v2 speech encoder and the Auto-AVSR video encoder by following the instructions provided in the README of the above repo. Please note that by doing so, you agree with their respective license terms. ## Training Data ### Monolingual | TASK | Task name | Dataset | Language | License | | -------- | ---------------------------- | ------------------ | -------- | ------------------------------------------ | | **ASR** | Automatic Speech Recognition | **LibriHeavy** | en | CC-BY-4.0 | | | | **LibriTTS** | en | CC BY 4.0 | | | | **AMI** | en | CC-BY-4.0 | | | | **ICSI** | en | CC-BY-4.0 | | **VSR** | Visual Speech Recognition | **LRS2-BBC** | en | Custom | | **SSUM** | Speech Summarization | **AMI** | en | CC-BY-4.0 | | | | **ICSI** | en | CC-BY-4.0 | | **SQA** | Spoken Question Answering | **Spoken SQUAD** | en | CC-BY-SA-4.0 | | **SLU** | Spoken Language Understanding| **SLURP** | en | CC BY 4.0 (text)
CC BY-NC 4.0 (audio) | ### Multilingual | TASK | Task name | Dataset | Language | License | | ---------------- | ----------------------------- | ------------------------------------ | ------------------------------------------- | ------------------------------------------ | | **ASR** | Automatic Speech Recognition | **CoVoST2** | en, fr, it, de, es | CC0 | | | | **CommonVoice** | en, fr, it, de, es | Apache-2.0 | | **ST** | Speech-to-text Translation | **CoVoST2** | en → de, {fr, it, de, es} → en | CC0 | | | | **EuroParl-ST** | {en, fr, it, de, es} → {en, fr, it, de, es} | CC-BY-NC-4.0 | | **MT** | Machine Translation | **EuroParl-ST** | {en, fr, it, de, es} → {en, fr, it, de, es} | CC-BY-NC-4.0 | | **TextInstruct** | Text Instruction Following | **Everything_Instruct_Multilingual** | en, fr, it, de, es, ru, zh, ko, ur, la, ar,
hi, ja, nl, pt | Apache-2.0 | | **SLU** | Spoken Language Understanding | **Speech-Massive** | fr, de | CC-BY-NC-SA-4.0 | ## Evaluation Results The following results specifically refer to the S model. ### ASR Metrics | Dataset | Language | WER ⬇ | |:----------|:-----------|------:| | **MUSTC** | en | 19.2 | | **MTEDX** | it | 29.43 | | **MTEDX** | fr | 28.97 | | **ACL6060** | en | 19.4 | | **MTEDX** | es | 29.71 | ### SQA Metrics | Dataset | Language | Accuracy ⬆ | |:--------------|:-----------|-----------:| | **Spoken SQuAD** | en | 65.93 | **NOTE**: Accuracy is measured with an LLM as a judge (**Llama3-70b-8192**, via the Groq API) using the following prompts: - **System prompt** You are a helpful assistant that evaluates answers to questions given a certain context. You will be given inputs of the form:
Context: \
Question: \
Answer: \
Your task is to determine if the given answer is correct or not, assuming the correct answer is contained in the context. Your response should be formatted as a JSON string having the following structure: {"correct_answer": \, "rationale": \} where 'rationale' must be a string explaining why the answer is correct or incorrect. If you need to include double quote characters (") in the 'rationale' string, you must escape them with a backslash (\\). For example, if you want to include the string "Hello, World!", you should write it as \\"Hello, World!\\". - **User prompt** Context: \
Question: \
Answer: \ ### MT Metrics | Dataset | Source Language | Target Language | Bleu ⬆ | CHRF ⬆ | |:----------|:------------------|:------------------|-------:|-------:| | **FLORES** | en | de | 21.11 | 51.77 | | **FLORES** | en | es | 18.61 | 48.02 | | **FLORES** | en | it | 16.63 | 47.24 | | **ACL6060** | en | fr | 34.86 | 60.48 | | **FLORES** | en | fr | 24 | 55.36 | ### SSUM Metrics | Dataset | Language | R-1_F1 | R-2_F1 | R-L_F1 | |:----------|:-----------|---------:|---------:|---------:| | **ICSI** | en | 22.9 | 2.7 | 20.4 | ### ST Metrics | Dataset | Source Language | Target Language | Bleu ⬆ | CHRF ⬆ | |:----------|:------------------|:------------------|-------:|-------:| | **ACL6060** | en | fr | 28.65 | 56.2 | | **ACL6060** | en | de | 19.12 | 49.06 | | **MUSTC** | en | de | 16.98 | 45.48 | | **MUSTC** | en | it | 14.68 | 43.03 | | **MUSTC** | en | fr | 19.09 | 48.09 | | **MUSTC** | en | es | 20.42 | 49.07 | ## Framework versions - Transformers 4.45.0 - Pytorch 2.3.1+cu124.post2 - Datasets 3.2.0 - Tokenizers 0.20.0