Update README.md
Browse files
README.md
CHANGED
@@ -245,17 +245,23 @@ tags:
|
|
245 |
|
246 |
ZeroSwot is a state-of-the-art zero-shot end-to-end Speech Translation system.
|
247 |
|
248 |
-
The model is created by adapting a wav2vec2.0-based encoder to the embedding space of NLLB, using a novel subword compression module and Optimal Transport, while
|
249 |
|
250 |
For more details please refer to our [paper](https://arxiv.org/abs/2402.10422) and the [original repo](https://github.com/mt-upc/ZeroSwot) build on fairseq.
|
251 |
|
252 |
-
|
|
|
|
|
253 |
|
254 |
<div align=center><img src="resources/methodology.png" height="100%" width="100%"/></div>
|
255 |
|
|
|
|
|
|
|
|
|
256 |
## Usage
|
257 |
|
258 |
-
The
|
259 |
|
260 |
```bash
|
261 |
pip install transformers torchaudio sentencepiece
|
|
|
245 |
|
246 |
ZeroSwot is a state-of-the-art zero-shot end-to-end Speech Translation system.
|
247 |
|
248 |
+
The model is created by adapting a wav2vec2.0-based encoder to the embedding space of NLLB, using a novel subword compression module and Optimal Transport, while only utilizing ASR data. It thus enables **Zero-shot E2E Speech Translation to all the 200 languages supported by NLLB**.
|
249 |
|
250 |
For more details please refer to our [paper](https://arxiv.org/abs/2402.10422) and the [original repo](https://github.com/mt-upc/ZeroSwot) build on fairseq.
|
251 |
|
252 |
+
## Architecture
|
253 |
+
|
254 |
+
The compression module is a light-weight transformer that takes as input the hidden state of wav2vec2.0 and the corresponding CTC predictions, and compresses them to subword-like embeddings similar to those expected from NLLB and aligns them using Optimal Transport. For inference we simply pass the output of the speech encoder to NLLB encoder.
|
255 |
|
256 |
<div align=center><img src="resources/methodology.png" height="100%" width="100%"/></div>
|
257 |
|
258 |
+
## Version
|
259 |
+
|
260 |
+
This version of ZeroSwot is trained with ASR data from CommonVoice, and adapted [wav2vec2.0-large](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self) to the [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model.
|
261 |
+
|
262 |
## Usage
|
263 |
|
264 |
+
The model is tested with python 3.9.16 and Transformer v4.41.2. Install also torchaudio and sentencepiece for processing.
|
265 |
|
266 |
```bash
|
267 |
pip install transformers torchaudio sentencepiece
|