metadata

language:
  - ace
  - acm
  - acq
  - aeb
  - af
  - ajp
  - ak
  - als
  - am
  - apc
  - ar
  - ars
  - ary
  - arz
  - as
  - ast
  - awa
  - ayr
  - azb
  - azj
  - ba
  - bm
  - ban
  - be
  - bem
  - bn
  - bho
  - bjn
  - bo
  - bs
  - bug
  - bg
  - ca
  - ceb
  - cs
  - cjk
  - ckb
  - crh
  - cy
  - da
  - de
  - dik
  - dyu
  - dz
  - el
  - en
  - eo
  - et
  - eu
  - ee
  - fo
  - fj
  - fi
  - fon
  - fr
  - fur
  - fuv
  - gaz
  - gd
  - ga
  - gl
  - gn
  - gu
  - ht
  - ha
  - he
  - hi
  - hne
  - hr
  - hu
  - hy
  - ig
  - ilo
  - id
  - is
  - it
  - jv
  - ja
  - kab
  - kac
  - kam
  - kn
  - ks
  - ka
  - kk
  - kbp
  - kea
  - khk
  - km
  - ki
  - rw
  - ky
  - kmb
  - kmr
  - knc
  - kg
  - ko
  - lo
  - lij
  - li
  - ln
  - lt
  - lmo
  - ltg
  - lb
  - lua
  - lg
  - luo
  - lus
  - lvs
  - mag
  - mai
  - ml
  - mar
  - min
  - mk
  - mt
  - mni
  - mos
  - mi
  - my
  - nl
  - nn
  - nb
  - npi
  - nso
  - nus
  - ny
  - oc
  - ory
  - pag
  - pa
  - pap
  - pbt
  - pes
  - plt
  - pl
  - pt
  - prs
  - quy
  - ro
  - rn
  - ru
  - sg
  - sa
  - sat
  - scn
  - shn
  - si
  - sk
  - sl
  - sm
  - sn
  - sd
  - so
  - st
  - es
  - sc
  - sr
  - ss
  - su
  - sv
  - swh
  - szl
  - ta
  - taq
  - tt
  - te
  - tg
  - tl
  - th
  - ti
  - tpi
  - tn
  - ts
  - tk
  - tum
  - tr
  - tw
  - tzm
  - ug
  - uk
  - umb
  - ur
  - uzn
  - vec
  - vi
  - war
  - wo
  - xh
  - ydd
  - yo
  - yue
  - zh
  - zsm
  - zu
language_details: >-
  ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab,
  aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab,
  asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl,
  bam_Latn, ban_Latn,bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn,
  bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn,
  cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn,
  dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn,
  ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn,
  fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr,
  hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn,
  hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn,
  jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva,
  kat_Geor, knc_Arab, knc_Latn, kaz_Cyrl, kbp_Latn, kea_Latn, khm_Khmr,
  kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kon_Latn, kor_Hang, kmr_Latn,
  lao_Laoo, lvs_Latn, lij_Latn, lim_Latn, lin_Latn, lit_Latn, lmo_Latn,
  ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn, mag_Deva,
  mai_Deva, mal_Mlym, mar_Deva, min_Latn, mkd_Cyrl, plt_Latn, mlt_Latn,
  mni_Beng, khk_Cyrl, mos_Latn, mri_Latn, zsm_Latn, mya_Mymr, nld_Latn,
  nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn, oci_Latn,
  gaz_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pol_Latn, por_Latn,
  prs_Arab, pbt_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl, sag_Latn,
  san_Deva, sat_Beng, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn, slv_Latn,
  smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn, als_Latn,
  srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn, szl_Latn,
  tam_Taml, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai, tir_Ethi,
  taq_Latn, taq_Tfng, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn, tum_Latn,
  tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn, urd_Arab,
  uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn, ydd_Hebr,
  yor_Latn, yue_Hant, zho_Hans, zho_Hant, zul_Latn
license: mit
metrics:
  - bleu
datasets:
  - mozilla-foundation/common_voice_8_0
pipeline_tag: automatic-speech-recognition
tags:
  - zeroswot
  - speech translation
  - zero-shot
  - end-to-end
  - nllb
  - wav2vec2

ZeroSwot ✨🤖✨

ZeroSwot is a state-of-the-art zero-shot end-to-end Speech Translation system.

The model is created by adapting a wav2vec2.0-based encoder to the embedding space of NLLB, using a novel subword compression module and Optimal Transport, while using only ASR data. It thus enables Speech Translation to all the 200 languages supported by NLLB. The compression module is a light-weight transformer that takes as input the hidden state of wav2vec2.0 and the corresponding CTC predictions, and compresses them to subword-like embeddings similar to those expected from NLLB and aligns them using Optimal Transport. For inference we simply pass the output of the speech encoder to NLLB encoder.

For more details please refer to our paper and the original repo build on fairseq.

This version of ZeroSwot is trained with ASR data from CommonVoice, and adapting wav2vec2.0-large to the nllb-200-distilled-600M model.

Usage

from transformers import Wav2Vec2Processor, NllbTokenizer, AutoModel, AutoModelForSeq2SeqLM
import soundfile as sf

# Load processors and tokenizers
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")

# Load ZeroSwot Encoder
commit_hash = "1d38f5dbf4f89adefe06961e4ec344b21f74ebae"
zeroswot_encoder = AutoModel.from_pretrained(
    "johntsi/ZeroSwot-Medium_asr-cv_en-to-200", trust_remote_code=True, revision=commit_hash,
)
model.eval()
model.to("cuda")

# Load NLLB Model
nllb_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
nllb_model.eval()
nllb_model.to("cuda")

# Load sample .wav
audio, sr = sf.read("sample.wav")
assert sr == 16000, "Input of wav2vec2.0 is expected to have sampling rate of 16,000"
input_values = processor(audio, sampling_rate=16000, return_tensors="pt").cuda()

# translation to German
compressed_embeds, attention_mask = zeroswot_encoder(**input_values)
predicted_ids = nllb_model.generate(
    inputs_embeds=compressed_embeds,
    attention_mask=attention_mask,
    forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"],
    num_beams=5,
)
translation = tokenizer.decode(predicted_ids[0], skip_special_tokens=True)
print(translation)

Results

BLEU scores on CoVoST-2 test compared to supervised SOTA models XLS-R-1B and SeamlessM4T-Medium. You can refer to Table 5 of the Results section in the paper for more details.

Models	ZS	Size (B)	Ar	Ca	Cy	De	Et	Fa	Id	Ja	Lv	Mn	Sl	Sv	Ta	Tr	Zh	Average
XLS-R-1B	✗	1.0	19.2	32.1	31.8	26.2	22.4	21.3	30.3	39.9	22.0	14.9	25.4	32.3	18.1	17.1	36.7	26.0
SeamlessM4T-M	✗	1.2	20.8	37.3	29.9	31.4	23.3	17.2	34.8	37.5	19.5	12.9	29.0	37.3	18.9	19.8	30.0	26.6
ZeroSwot-M_asr-cv	✓	0.35/0.95	24.4	38.7	28.8	31.2	26.2	26.0	36.0	46.0	24.8	19.0	31.6	37.8	24.4	18.6	39.0	30.2

Citation

If you find ZeroSwot useful for your research, please cite our paper :)

@misc{tsiamas2024pushing,
      title={{Pushing the Limits of Zero-shot End-to-End Speech Translation}}, 
      author={Ioannis Tsiamas and Gerard I. Gállego and José A. R. Fonollosa and Marta R. Costa-jussà},
      year={2024},
      eprint={2402.10422},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}