johntsi
/

ZeroSwot-Medium_asr-cv_en-to-200

Automatic Speech Recognition

zero_swot_encoder

feature-extraction

speech translation

Model card Files Files and versions Community

johntsi commited on Jun 25

Commit

cd42135

•

1 Parent(s): 4b39cbd

Update README.md

Files changed (1) hide show

README.md +9 -3

README.md CHANGED Viewed

@@ -245,17 +245,23 @@ tags:
 ZeroSwot is a state-of-the-art zero-shot end-to-end Speech Translation system.
-The model is created by adapting a wav2vec2.0-based encoder to the embedding space of NLLB, using a novel subword compression module and Optimal Transport, while using only ASR data. It thus enables **Speech Translation to all the 200 languages supported by NLLB**. The compression module is a light-weight transformer that takes as input the hidden state of wav2vec2.0 and the corresponding CTC predictions, and compresses them to subword-like embeddings similar to those expected from NLLB and aligns them using Optimal Transport. For inference we simply pass the output of the speech encoder to NLLB encoder.
 For more details please refer to our [paper](https://arxiv.org/abs/2402.10422) and the [original repo](https://github.com/mt-upc/ZeroSwot) build on fairseq.
-This version of ZeroSwot is trained with ASR data from CommonVoice, and adapting [wav2vec2.0-large](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self) to the [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model.
 <div align=center><img src="resources/methodology.png" height="100%" width="100%"/></div>
 ## Usage
-The usage is tested with python 3.9.16 and Transformer v4.41.2. Install also torchaudio and sentencepiece for processing.
 ```bash
 pip install transformers torchaudio sentencepiece

 ZeroSwot is a state-of-the-art zero-shot end-to-end Speech Translation system.
+The model is created by adapting a wav2vec2.0-based encoder to the embedding space of NLLB, using a novel subword compression module and Optimal Transport, while only utilizing ASR data. It thus enables **Zero-shot E2E Speech Translation to all the 200 languages supported by NLLB**.
 For more details please refer to our [paper](https://arxiv.org/abs/2402.10422) and the [original repo](https://github.com/mt-upc/ZeroSwot) build on fairseq.
+## Architecture
+The compression module is a light-weight transformer that takes as input the hidden state of wav2vec2.0 and the corresponding CTC predictions, and compresses them to subword-like embeddings similar to those expected from NLLB and aligns them using Optimal Transport. For inference we simply pass the output of the speech encoder to NLLB encoder.
 <div align=center><img src="resources/methodology.png" height="100%" width="100%"/></div>
+## Version
+This version of ZeroSwot is trained with ASR data from CommonVoice, and adapted [wav2vec2.0-large](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self) to the [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model.
 ## Usage
+The model is tested with python 3.9.16 and Transformer v4.41.2. Install also torchaudio and sentencepiece for processing.
 ```bash
 pip install transformers torchaudio sentencepiece