Wav2Vec2-Large-XLSR-53 fine-tuned for automatic transcription of Sakhalin Ainu

This is a wav2vec-large-xlsr-53 model after continued pretraining on speech data in Hokkaido Ainu and Sakhalin Ainu (see wav2vec2-large-xlsr-53-pretrain-ain) and fine-tuning for automatic speech recognition on 10h of labeled Sakhalin Ainu data. For details, please refer to the paper.

On our evaluation set, the model yielded the following results:

  • CER: 9.7
  • WER: 29.8

Citation

When using the model please cite the following paper:

@article{NOWAKOWSKI2023103148,
title = {Adapting multilingual speech representation model for a new, underresourced language through multilingual fine-tuning and continued pretraining},
journal = {Information Processing & Management},
volume = {60},
number = {2},
pages = {103148},
year = {2023},
issn = {0306-4573},
doi = {https://doi.org/10.1016/j.ipm.2022.103148},
url = {https://www.sciencedirect.com/science/article/pii/S0306457322002497},
author = {Karol Nowakowski and Michal Ptaszynski and Kyoko Murasaki and Jagna Nieuważny},
keywords = {Automatic speech transcription, ASR, Wav2vec 2.0, Pretrained transformer models, Speech representation models, Cross-lingual transfer, Language documentation, Endangered languages, Underresourced languages, Sakhalin Ainu},
abstract = {In recent years, neural models learned through self-supervised pretraining on large scale multilingual text or speech data have exhibited promising results for underresourced languages, especially when a relatively large amount of data from related language(s) is available. While the technology has a potential for facilitating tasks carried out in language documentation projects, such as speech transcription, pretraining a multilingual model from scratch for every new language would be highly impractical. We investigate the possibility for adapting an existing multilingual wav2vec 2.0 model for a new language, focusing on actual fieldwork data from a critically endangered tongue: Ainu. Specifically, we (i) examine the feasibility of leveraging data from similar languages also in fine-tuning; (ii) verify whether the model’s performance can be improved by further pretraining on target language data. Our results show that continued pretraining is the most effective method to adapt a wav2vec 2.0 model for a new language and leads to considerable reduction in error rates. Furthermore, we find that if a model pretrained on a related speech variety or an unrelated language with similar phonological characteristics is available, multilingual fine-tuning using additional data from that language can have positive impact on speech recognition performance when there is very little labeled data in the target language.}
}
Downloads last month
10
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.