Update README.md

b3d6291 verified 17 days ago

5.1 kB

	---
	# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
	# Doc / guide: https://huggingface.co/docs/hub/model-cards
	license: cc-by-4.0
	language:
	- fr
	- en
	library_name: hibiki
	tags:
	- speech
	- translation
	- streaming
	metrics:
	- bleu
	---

	# Model Card for Hibiki

	[Hibiki](https://github.com/kyutai-labs/hibiki) is a model for streaming speech translation (also known as simultaneous translation). Unlike offline translation—where one waits for the end of the source utterance to start translating--- Hibiki adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. As the user speaks, Hibiki generates natural speech in the target language, optionally with voice transfer, along with a text translation.
	Hibiki currently only supports French-to-English translation.

	## Model Details

	This is a model referred to as Hibiki-M (for Mobile) in our [paper](https://arxiv.org/abs/2502.03382), a 1.7B parameter
	hierarchical Transformer producing speech and text tokens at a framerate of 12.5Hz, with audio being generated at a
	1.1kbps bitrate.

	### Model Description

	Hibiki is a decoder-only model for simultaneous speech translation. Hibiki leverages the multistream architecture of [Moshi](https://arxiv.org/abs/2410.00037)
	to model source and target speech jointly. This allows Hibiki to continuously process the input stream while generating
	the target speech. Hibiki produces text and audio tokens at a constant framerate of 12.5Hz. This allows for a continuous
	output audio stream, along with timestamped text tranlsation. Since Hibiki relies on simple temperature sampling,
	it is compatible with batching unlike models that rely on complex inference policies. Moreover, the fidelity of Hibiki's
	voice transfer can be controlled by changing the coefficient of the Classifier-Free Guidance: a larger coefficient will
	increase voice similarity, but excessive coefficients can lead to worse translations.


	- Developed by: Kyutai
	- Model type: Simultaneous speech-to-speech and speech-to-text translation.
	- Language(s) (NLP): French-to-English
	- License: CC-BY

	### Model Sources


	- Repository: [repo](https://github.com/kyutai-labs/hibiki)
	- Paper: [paper](https://arxiv.org/abs/2502.03382)
	- Examples: [demo](https://hf.co/spaces/kyutai/hibiki-samples)

	## Uses

	### Direct Use

	The model can be used for streaming translation from French to English in real-time settings, or for batched
	simultaneous translation of many input sequences. It is robust to noisy conditions and is trained on sequences up
	to 120 seconds.


	### Downstream Use

	Some components of the model can be used independently or repurposed relatively easily.
	For instance the Mimi codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems. Regarding the main Hibiki architecture,
	supporting other pairs of languages would require finetuning.


	### Out-of-Scope Use

	The model is not intended to be used to impersonate other people or any malicious use of any kind.


	## How to Get Started with the Model

	See the main [README](https://github.com/kyutai-labs/hibiki) file.

	## Training Details

	### Training Data

	- Textual data: The underlying [Helium](https://huggingface.co/kyutai/helium-1-preview-2b) model is trained on a mix of
	data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl.

	- Audio data

	- Unsupervised audio dataset: used for pre-training, this is a collection of 7M hours of readily available audio content in English and 450k hours in French, following the preprocessing and recipe of [Moshi](https://arxiv.org/abs/2410.00037).
	- Synthetic translation dataset: Around 40k hours of parallel French-English data synthesized with contextual alignment (see [Section 3.2](https://arxiv.org/pdf/2502.03382)) with various levels of speaker similarity.
	- Translation finetuning: A 900 hours mixture of a resynthesized version of [CVSS-T](https://github.com/google-research-datasets/cvss) and synthetic long-form utterances.

	### Training procedure and hyper-parameters

	The different stages of the training procedure are detailled in the paper along with the hyper-parameters.

	### Compute Infrastructure

	The final model was trained on 48 H100 Nvidia GPUs.

	## Citation

	```
	@misc{labiausse2025hibiki,
	title={High-Fidelity Simultaneous Speech-To-Speech Translation},
	author={Tom Labiausse and Laurent Mazaré and Edouard Grave and Patrick Pérez and Alexandre Défossez and Neil Zeghidour},
	year={2025},
	eprint={2502.03382},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2502.03382},
	}
	```


	## Model Card Authors

	Tom Labiausse, Laurent Mazaré, Edouard Grave, Patrick Pérez, Alexandre Défossez, Neil Zeghidour