wav2vec2-base-cy / README.md
DewiBrynJones's picture
Update README.md
6b5b0a1 verified
|
raw
history blame
2.69 kB
---
license: apache-2.0
language:
- cy
tags:
- speech
- pre-training
- wav2vec2
---
# Better Pre-trained wav2vec2 models for Welsh Speech Recognition
At the moment, the best Welsh speech recognition wav2vec2 models are achieved from
fine-tuning [XLSR-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53 and
[xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) pre-trained models
by Facebook/Meta AI.
This model is experimental in investigating better pre-trained models with more
Welsh language speech that could in turn lower WER scores even further in subsequent
fine-tuned models. __It is of very limited use for any fine-tuning on any useful downstream
task such as speech recognition__.
## First Attempts with Self-Supervised Learning
Previous attempts drew heavilty on the resources and documentation from the HuggingFace examples
for creating pre-trained wav2vec2 models from scratch:
https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-pretraining
we used only 4000 hours of Welsh and Engish speech audio collected from various channels on
YouTube, The training set contained a balance of approximately 25% Welsh speech and 75%
English language speech. The English language data however contains examples of Welsh-accented
English speech and therefore was retained for pretraining.
The results of our self-supervised attempts can be accessed from revisions `22.10` and `24.03` of
this model repository.
## Attempting with Fine-tuning Meta AI models with a very weak data set
The latest attempt invesigates reverting back to fine-tuning Meta AI's pre-trained models (xls-r-1b)
with the YouTube speech data having been transcribed automatically with the best Whisper based ASR
models for Welsh and English: https://huggingface.co/techiaith/whisper-large-v3-ft-cv-cy-en
The transcriptions are of course not totally correct, hence why we're termed it as a very weak data
set. But since it has a much larger collection of speech, and much larger than [any other dataset for
Welsh](https://huggingface.co/collections/techiaith/speech-recognition-datasets-672df8ffb3f7da8ed8294ce2)
we wanted to nevertheless experiment with what impact (if any) the speech audio may still have on
the wav2vec2 encoders.
## Conclusion
Until we have collected many more hours of speech,
As already mentioned above, the model is not useful for any use. More hours of speech has to be collected.
In the meantime, we have have identified issues and limitations in our YouTube data, such as the quality
the speech audio and of the automatic transcriptions. Further work is required to correct those issues and/or
if is a feasible dataset.