Update README.md

a10e529 verified 10 days ago

7.02 kB

	---
	library_name: transformers
	tags:
	- generated_from_trainer
	model-index:
	- name: Llama-speechlmm-1.0-xl
	results: []
	---

	## Model information

	The SpeechLMM 1.0 collection of multimodal and multilingual large language models is a collection of instruction-tuned generative models in 4 different sizes: S (2B), M (4B), L (9B) and XL (71B), supporting text, audio and video as input and only text as output. The SpeechLMM 1.0 models are optimized for various X-to-text generation tasks, namely:

	- Machine Translation
	- Automatic Speech Recognition
	- Speech Translation
	- Speech Summarization
	- Spoken Question Answering
	- Spoken Language Understanding (beta)
	- Visual Speech Recognition (beta)

	Model Developer: Meetween consortium

	Supported Languages: English, French, Italian, German, and Spanish are officially supported (for a subset of the supported tasks). The Llama 3.X backbone and the SeamlessM4T v2 audio encoder have been trained on a broader collection of languages than these 5 supported languages, so the model might exhibit good performance on other languages too.

	Model Release Date: Feb 28, 2025

	License: see [LICENSE](LICENSE)

	### Model Architecture

	SpeechLMM 1.0 an auto-regressive multimodal language model based on a Llama 3.X backbone (X varies with the model size), a speech-specific stack consisting of a pre-trained audio encoder ([SeamlessM4T v2](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/)) and an audio adapter, and a video-specific stack consisting of a pre-trained video encoder ([Auto-AVSR](https://ieeexplore.ieee.org/document/10096889)) and a video adapter.

	<!-- TODO: add the image of the model architecture here -->

	\| Model \| Params \| Input modalities \| Output modalities \| Context Length \|
	\|:---------------- \|:----------- \|:------------------------------------------ \|:----------------- \|:-------------- \|
	\| SpeechLMM 1.0 S \| 2B (2.17B) \| Multilingual text and audio, English video \| Multilingual Text \| 128k \|
	\| SpeechLMM 1.0 M \| 4B (4.15B) \| Multilingual text and audio, English video \| Multilingual Text \| 128k \|
	\| SpeechLMM 1.0 L \| 9B (8.98B) \| Multilingual text and audio, English video \| Multilingual Text \| 128k \|
	\| SpeechLMM 1.0 XL (beta) \| 71B (71.5B) \| Multilingual text and audio, English video \| Multilingual Text \| 128k \|

	#### Audio and video encoders

	For all the 4 sizes of SpeechLMM 1.0, the audio encoder is SeamlessM4T v2 Large (`facebook/seamless-m4t-v2-large`) and the video encoder is Auto-AVSR (`vsr_trlrs3vox2_base`).

	#### Audio and video adapters

	For all the 4 sizes of SpeechLMM 1.0, the audio and video adapters are:
	\| Modality \| Architecture \| Number of layers \| Compression factor \|
	\| :------- \| :----------- \| :--------------- \| :----------------- \|
	\| Audio \| MLP \| 4 \| 1 \|
	\| Video \| Window-level Q-former <br> (4 queries) \| 4 \| 4 \|

	#### LLM backbone

	\| Model \| Backbone \|
	\|:---------------- \|:---------------------- \|
	\| SpeechLMM 1.0 S \| Llama 3.2 1B Instruct \|
	\| SpeechLMM 1.0 M \| Llama 3.2 3B Instruct \|
	\| SpeechLMM 1.0 L \| Llama 3.1 8B Instruct \|
	\| SpeechLMM 1.0 XL (beta) \| Llama 3.3 70B Instruct \|

	## How to use

	Currently, this model can only be used via our [`speechlmm`](https://github.com/meetween/speechlmm) codebase. Refer to the instructions there for more details.

	Important: before you can use this model, you must download the SeamlessM4T v2 speech encoder and the Auto-AVSR video encoder by following the instructions provided in the README of the above repo. Please note that by doing so, you agree with their respective license terms.

	## Training Data

	### Monolingual

	\| TASK \| Task name \| Dataset \| Language \| License \|
	\| -------- \| ---------------------------- \| ------------------ \| -------- \| ------------------------------------------ \|
	\| ASR \| Automatic Speech Recognition \| LibriHeavy \| en \| CC-BY-4.0 \|
	\| \| \| LibriTTS \| en \| CC BY 4.0 \|
	\| \| \| AMI \| en \| CC-BY-4.0 \|
	\| \| \| ICSI \| en \| CC-BY-4.0 \|
	\| VSR \| Visual Speech Recognition \| LRS2-BBC \| en \| Custom \|
	\| SSUM \| Speech Summarization \| AMI \| en \| CC-BY-4.0 \|
	\| \| \| ICSI \| en \| CC-BY-4.0 \|
	\| SQA \| Spoken Question Answering \| Spoken SQUAD \| en \| CC-BY-SA-4.0 \|
	\| SLU \| Spoken Language Understanding\| SLURP \| en \| CC BY 4.0 (text) <br> CC BY-NC 4.0 (audio) \|

	### Multilingual

	\| TASK \| Task name \| Dataset \| Language \| License \|
	\| ---------------- \| ----------------------------- \| ------------------------------------ \| ------------------------------------------- \| ------------------------------------------ \|
	\| ASR \| Automatic Speech Recognition \| CoVoST2 \| en, fr, it, de, es \| CC0 \|
	\| \| \| CommonVoice \| en, fr, it, de, es \| Apache-2.0 \|
	\| ST \| Speech-to-text Translation \| CoVoST2 \| en → de, {fr, it, de, es} → en \| CC0 \|
	\| \| \| EuroParl-ST \| {en, fr, it, de, es} → {en, fr, it, de, es} \| CC-BY-NC-4.0 \|
	\| MT \| Machine Translation \| EuroParl-ST \| {en, fr, it, de, es} → {en, fr, it, de, es} \| CC-BY-NC-4.0 \|
	\| TextInstruct \| Text Instruction Following \| Everything_Instruct_Multilingual \| en, fr, it, de, es, ru, zh, ko, ur, la, ar,<br>hi, ja, nl, pt \| Apache-2.0 \|
	\| SLU \| Spoken Language Understanding \| Speech-Massive \| fr, de \| CC-BY-NC-SA-4.0 \|

	## Evaluation Results
	Results for the XL model are coming soon...

	## Framework versions

	- Transformers 4.45.0
	- Pytorch 2.3.1+cu124.post2
	- Datasets 3.2.0
	- Tokenizers 0.20.0

	---
	library_name: transformers
	tags:
	- generated_from_trainer
	model-index:
	- name: Llama-speechlmm-1.0-xl
	results: []
	---

	## Model information

	The SpeechLMM 1.0 collection of multimodal and multilingual large language models is a collection of instruction-tuned generative models in 4 different sizes: S (2B), M (4B), L (9B) and XL (71B), supporting text, audio and video as input and only text as output. The SpeechLMM 1.0 models are optimized for various X-to-text generation tasks, namely:

	- Machine Translation
	- Automatic Speech Recognition
	- Speech Translation
	- Speech Summarization
	- Spoken Question Answering
	- Spoken Language Understanding (beta)
	- Visual Speech Recognition (beta)

	Model Developer: Meetween consortium

	Supported Languages: English, French, Italian, German, and Spanish are officially supported (for a subset of the supported tasks). The Llama 3.X backbone and the SeamlessM4T v2 audio encoder have been trained on a broader collection of languages than these 5 supported languages, so the model might exhibit good performance on other languages too.

	Model Release Date: Feb 28, 2025

	License: see [LICENSE](LICENSE)

	### Model Architecture

	SpeechLMM 1.0 an auto-regressive multimodal language model based on a Llama 3.X backbone (X varies with the model size), a speech-specific stack consisting of a pre-trained audio encoder ([SeamlessM4T v2](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/)) and an audio adapter, and a video-specific stack consisting of a pre-trained video encoder ([Auto-AVSR](https://ieeexplore.ieee.org/document/10096889)) and a video adapter.

	<!-- TODO: add the image of the model architecture here -->

	\| Model \| Params \| Input modalities \| Output modalities \| Context Length \|
	\|:---------------- \|:----------- \|:------------------------------------------ \|:----------------- \|:-------------- \|
	\| SpeechLMM 1.0 S \| 2B (2.17B) \| Multilingual text and audio, English video \| Multilingual Text \| 128k \|
	\| SpeechLMM 1.0 M \| 4B (4.15B) \| Multilingual text and audio, English video \| Multilingual Text \| 128k \|
	\| SpeechLMM 1.0 L \| 9B (8.98B) \| Multilingual text and audio, English video \| Multilingual Text \| 128k \|
	\| SpeechLMM 1.0 XL (beta) \| 71B (71.5B) \| Multilingual text and audio, English video \| Multilingual Text \| 128k \|

	#### Audio and video encoders

	For all the 4 sizes of SpeechLMM 1.0, the audio encoder is SeamlessM4T v2 Large (`facebook/seamless-m4t-v2-large`) and the video encoder is Auto-AVSR (`vsr_trlrs3vox2_base`).

	#### Audio and video adapters

	For all the 4 sizes of SpeechLMM 1.0, the audio and video adapters are:
	\| Modality \| Architecture \| Number of layers \| Compression factor \|
	\| :------- \| :----------- \| :--------------- \| :----------------- \|
	\| Audio \| MLP \| 4 \| 1 \|
	\| Video \| Window-level Q-former <br> (4 queries) \| 4 \| 4 \|

	#### LLM backbone

	\| Model \| Backbone \|
	\|:---------------- \|:---------------------- \|
	\| SpeechLMM 1.0 S \| Llama 3.2 1B Instruct \|
	\| SpeechLMM 1.0 M \| Llama 3.2 3B Instruct \|
	\| SpeechLMM 1.0 L \| Llama 3.1 8B Instruct \|
	\| SpeechLMM 1.0 XL (beta) \| Llama 3.3 70B Instruct \|

	## How to use

	Currently, this model can only be used via our [`speechlmm`](https://github.com/meetween/speechlmm) codebase. Refer to the instructions there for more details.

	Important: before you can use this model, you must download the SeamlessM4T v2 speech encoder and the Auto-AVSR video encoder by following the instructions provided in the README of the above repo. Please note that by doing so, you agree with their respective license terms.

	## Training Data

	### Monolingual

	\| TASK \| Task name \| Dataset \| Language \| License \|
	\| -------- \| ---------------------------- \| ------------------ \| -------- \| ------------------------------------------ \|
	\| ASR \| Automatic Speech Recognition \| LibriHeavy \| en \| CC-BY-4.0 \|
	\| \| \| LibriTTS \| en \| CC BY 4.0 \|
	\| \| \| AMI \| en \| CC-BY-4.0 \|
	\| \| \| ICSI \| en \| CC-BY-4.0 \|
	\| VSR \| Visual Speech Recognition \| LRS2-BBC \| en \| Custom \|
	\| SSUM \| Speech Summarization \| AMI \| en \| CC-BY-4.0 \|
	\| \| \| ICSI \| en \| CC-BY-4.0 \|
	\| SQA \| Spoken Question Answering \| Spoken SQUAD \| en \| CC-BY-SA-4.0 \|
	\| SLU \| Spoken Language Understanding\| SLURP \| en \| CC BY 4.0 (text) <br> CC BY-NC 4.0 (audio) \|

	### Multilingual

	\| TASK \| Task name \| Dataset \| Language \| License \|
	\| ---------------- \| ----------------------------- \| ------------------------------------ \| ------------------------------------------- \| ------------------------------------------ \|
	\| ASR \| Automatic Speech Recognition \| CoVoST2 \| en, fr, it, de, es \| CC0 \|
	\| \| \| CommonVoice \| en, fr, it, de, es \| Apache-2.0 \|
	\| ST \| Speech-to-text Translation \| CoVoST2 \| en → de, {fr, it, de, es} → en \| CC0 \|
	\| \| \| EuroParl-ST \| {en, fr, it, de, es} → {en, fr, it, de, es} \| CC-BY-NC-4.0 \|
	\| MT \| Machine Translation \| EuroParl-ST \| {en, fr, it, de, es} → {en, fr, it, de, es} \| CC-BY-NC-4.0 \|
	\| TextInstruct \| Text Instruction Following \| Everything_Instruct_Multilingual \| en, fr, it, de, es, ru, zh, ko, ur, la, ar,<br>hi, ja, nl, pt \| Apache-2.0 \|
	\| SLU \| Spoken Language Understanding \| Speech-Massive \| fr, de \| CC-BY-NC-SA-4.0 \|

	## Evaluation Results
	Results for the XL model are coming soon...

	## Framework versions

	- Transformers 4.45.0
	- Pytorch 2.3.1+cu124.post2
	- Datasets 3.2.0
	- Tokenizers 0.20.0