inference: false
tags:
- SeamlessM4T
- seamless_m4t
license: cc-by-nc-4.0
library_name: transformers
SeamlessM4T Large
SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.
This repository hosts 🤗 Hugging Face's implementation of SeamlessM4T. You can find the original weights, as well as a guide on how to run them in the original hub repositories (large and medium checkpoints).
SeamlessM4T Large covers:
- 📥 101 languages for speech input
- ⌨️ 96 Languages for text input/output
- 🗣️ 35 languages for speech output.
This is the "large" variant of the unified model, which enables multiple tasks without relying on multiple separate models:
- Speech-to-speech translation (S2ST)
- Speech-to-text translation (S2TT)
- Text-to-speech translation (T2ST)
- Text-to-text translation (T2TT)
- Automatic speech recognition (ASR)
You can perform all the above tasks from one single model, SeamlessM4TModel
, but each task also has its own dedicated sub-model.
🤗 Usage
First, load the processor and a checkpoint of the model:
from transformers import AutoProcessor, SeamlessM4TModel
processor = AutoProcessor.from_pretrained("ylacombe/hf-seamless-m4t-large")
model = SeamlessM4TModel.from_pretrained("ylacombe/hf-seamless-m4t-large")
You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
Speech
You can easily generate translated speech with SeamlessM4TModel.generate
. Here is an example showing how to generate speech from English to Russian.
inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
audio_array = model.generate(**inputs, tgt_lang="rus")
audio_array = audio_array[0].cpu().numpy().squeeze()
You can also translate directly from a speech waveform. Here is an example from Arabic to English:
from datasets import load_dataset
dataset = load_dataset("arabic_speech_corpus", split="test[0:1]")
audio_sample = dataset["audio"][0]["array"]
inputs = processor(audios = audio_sample, return_tensors="pt")
audio_array = model.generate(**inputs, tgt_lang="rus")
audio_array = audio_array[0].cpu().numpy().squeeze()
Listen to the speech samples either in an ipynb notebook:
from IPython.display import Audio
sampling_rate = model.config.sample_rate
Audio(audio_array, rate=sampling_rate)
Or save them as a .wav
file using a third-party library, e.g. scipy
:
import scipy
sampling_rate = model.config.sample_rate
scipy.io.wavfile.write("seamless_m4t_out.wav", rate=sampling_rate, data=audio_array)
Tips
SeamlessM4TModel
is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
For example, you can replace the previous snippet with the model dedicated to the S2ST task:
from transformers import SeamlessM4TForSpeechToSpeech
model = SeamlessM4TForSpeechToSpeech.from_pretrained("ylacombe/hf-seamless-m4t-large")
Text
Similarly, you can generate translated text from text or audio files. This time, let's use the dedicated models as example.
from transformers import SeamlessM4TForSpeechToText
model = SeamlessM4TForSpeechToText.from_pretrained("ylacombe/hf-seamless-m4t-large")
audio_sample = dataset["audio"][0]["array"]
inputs = processor(audios = audio_sample, return_tensors="pt")
output_tokens = model.generate(**inputs, tgt_lang="fra")
translated_text = processor.decode(output_tokens.tolist()[0], skip_special_tokens=True)
And from text:
from transformers import SeamlessM4TForTextToText
model = SeamlessM4TForTextToText.from_pretrained("ylacombe/hf-seamless-m4t-large")
inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
output_tokens = model.generate(**inputs, tgt_lang="fra")
translated_text = processor.decode(output_tokens.tolist()[0], skip_special_tokens=True)
Tips
Three last tips:
SeamlessM4TModel
can generate text and/or speech. Passgenerate_speech=False
toSeamlessM4TModel.generate
to only generate text. You also have the possibility to passreturn_intermediate_token_ids=True
, to get both text token ids and the generated speech.- You have the possibility to change the speaker used for speech synthesis with the
spkr_id
argument. - You can use different generation strategies for speech and text generation, e.g
.generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)
which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model.