|
--- |
|
|
|
|
|
license: cc-by-4.0 |
|
language: |
|
- fr |
|
- en |
|
library_name: hibiki |
|
tags: |
|
- speech |
|
- translation |
|
- streaming |
|
metrics: |
|
- bleu |
|
--- |
|
|
|
# Model Card for Hibiki |
|
|
|
[Hibiki](https://github.com/kyutai-labs/hibiki) is a model for streaming speech translation (also known as *simultaneous* translation). Unlike offline translation—where one waits for the end of the source utterance to start translating--- Hibiki adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. As the user speaks, Hibiki generates natural speech in the target language, optionally with voice transfer, along with a text translation. |
|
Hibiki currently only supports French-to-English translation. |
|
|
|
## Model Details |
|
|
|
This is a model referred to as *Hibiki-M* (for *Mobile*) in our [paper](https://arxiv.org/abs/2502.03382), a 1.7B parameter |
|
hierarchical Transformer producing speech and text tokens at a framerate of 12.5Hz, with audio being generated at a |
|
1.1kbps bitrate. |
|
|
|
### Model Description |
|
|
|
Hibiki is a decoder-only model for simultaneous speech translation. Hibiki leverages the multistream architecture of [Moshi](https://arxiv.org/abs/2410.00037) |
|
to model source and target speech jointly. This allows Hibiki to continuously process the input stream while generating |
|
the target speech. Hibiki produces text and audio tokens at a constant framerate of 12.5Hz. This allows for a continuous |
|
output audio stream, along with timestamped text tranlsation. Since Hibiki relies on simple temperature sampling, |
|
it is compatible with batching unlike models that rely on complex inference policies. Moreover, the fidelity of Hibiki's |
|
voice transfer can be controlled by changing the coefficient of the Classifier-Free Guidance: a larger coefficient will |
|
increase voice similarity, but excessive coefficients can lead to worse translations. |
|
|
|
|
|
- **Developed by:** Kyutai |
|
- **Model type:** Simultaneous speech-to-speech and speech-to-text translation. |
|
- **Language(s) (NLP):** French-to-English |
|
- **License:** CC-BY |
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [repo](https://github.com/kyutai-labs/hibiki) |
|
- **Paper:** [paper](https://arxiv.org/abs/2502.03382) |
|
- **Examples:** [demo](https://hf.co/spaces/kyutai/hibiki-samples) |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
The model can be used for streaming translation from French to English in real-time settings, or for batched |
|
simultaneous translation of many input sequences. It is robust to noisy conditions and is trained on sequences up |
|
to 120 seconds. |
|
|
|
|
|
### Downstream Use |
|
|
|
Some components of the model can be used independently or repurposed relatively easily. |
|
For instance the Mimi codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems. Regarding the main Hibiki architecture, |
|
supporting other pairs of languages would require finetuning. |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
The model is not intended to be used to impersonate other people or any malicious use of any kind. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
See the main [README](https://github.com/kyutai-labs/hibiki) file. |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
- Textual data: The underlying [Helium](https://huggingface.co/kyutai/helium-1-preview-2b) model is trained on a mix of |
|
data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl. |
|
|
|
- Audio data |
|
|
|
- **Unsupervised audio dataset:** used for pre-training, this is a collection of 7M hours of readily available audio content in English and 450k hours in French, following the preprocessing and recipe of [Moshi](https://arxiv.org/abs/2410.00037). |
|
- **Synthetic translation dataset**: Around 40k hours of parallel French-English data synthesized with *contextual alignment* (see [Section 3.2](https://arxiv.org/pdf/2502.03382)) with various levels of speaker similarity. |
|
- **Translation finetuning:** A 900 hours mixture of a resynthesized version of [CVSS-T](https://github.com/google-research-datasets/cvss) and synthetic long-form utterances. |
|
|
|
### Training procedure and hyper-parameters |
|
|
|
The different stages of the training procedure are detailled in the paper along with the hyper-parameters. |
|
|
|
### Compute Infrastructure |
|
|
|
The final model was trained on 48 H100 Nvidia GPUs. |
|
|
|
## Citation |
|
|
|
``` |
|
@misc{labiausse2025hibiki, |
|
title={High-Fidelity Simultaneous Speech-To-Speech Translation}, |
|
author={Tom Labiausse and Laurent Mazaré and Edouard Grave and Patrick Pérez and Alexandre Défossez and Neil Zeghidour}, |
|
year={2025}, |
|
eprint={2502.03382}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2502.03382}, |
|
} |
|
``` |
|
|
|
|
|
## Model Card Authors |
|
|
|
Tom Labiausse, Laurent Mazaré, Edouard Grave, Patrick Pérez, Alexandre Défossez, Neil Zeghidour |