|
## Overview |
|
|
|
We present a CLSRIL-23 (Cross Lingual Speech Representations on Indic Languages), a self supervised learning based audio pre-trained model which learns cross |
|
lingual speech representations from raw audio across **23 Indic languages**. It is built on top of wav2vec |
|
2.0 which is solved by training a contrastive task over masked latent speech representations and |
|
jointly learns the quantization of latents shared across all languages. |
|
|
|
[Arxiv Link](https://arxiv.org/pdf/2107.07402.pdf) |
|
|
|
[Original Repo](https://github.com/Open-Speech-EkStep/vakyansh-models) contains models in fairseq format. |
|
|
|
## Languages in the pretraining dataset |
|
|
|
| Language | Data (In Hrs) | |
|
|-----------|---------------| |
|
| Assamese | 254.9 | |
|
| Bengali | 331.3 | |
|
| Bodo | 26.9 | |
|
| Dogri | 17.1 | |
|
| English | 819.7 | |
|
| Gujarati | 336.7 | |
|
| Hindi | 4563.7 | |
|
| Kannada | 451.8 | |
|
| Kashmiri | 67.8 | |
|
| Konkani | 36.8 | |
|
| Maithili | 113.8 | |
|
| Malayalam | 297.7 | |
|
| Manipuri | 171.9 | |
|
| Marathi | 458.2 | |
|
| Nepali | 31.6 | |
|
| Odia | 131.4 | |
|
| Punjabi | 486.05 | |
|
| Sanskrit | 58.8 | |
|
| Santali | 6.56 | |
|
| Sindhi | 16 | |
|
| Tamil | 542.6 | |
|
| Telugu | 302.8 | |
|
| Urdu | 259.68 | |
|
|
|
## Repo for training: |
|
|
|
[Experimentation](https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation) platform built on top of fairseq. |
|
|