Cantonese Wav2Vec2-Conformer-Base with Relative Position Embeddings

wav2vec 2.0 Conformer with relative position embeddings, pretrained on 2.8K hours of Cantonese spontaneous speech data sampled at 16kHz.

Note: This model has not been fine-tuned on labeled text data.

Alternative Version

An alternative version of the model which was pre-trained on the same dataset but with setting layer_norm_first to false is available here as a fairseq checkpoint and may give better downstream results.

Citation

Please cite the following paper if you use the model.

@inproceedings{huang23h_interspeech,
  author={Ranzo Huang and Brian Mak},
  title={{wav2vec 2.0 ASR for Cantonese-Speaking Older Adults in a Clinical Setting}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={4958--4962},
  doi={10.21437/Interspeech.2023-2470}
}