techiaith
/

wav2vec2-base-cy

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions

wav2vec2-base-cy / README.md

DewiBrynJones's picture

Update README.md

6b5b0a1 verified about 2 months ago

|

2.69 kB

	---
	license: apache-2.0
	language:
	- cy
	tags:
	- speech
	- pre-training
	- wav2vec2
	---

	# Better Pre-trained wav2vec2 models for Welsh Speech Recognition

	At the moment, the best Welsh speech recognition wav2vec2 models are achieved from
	fine-tuning [XLSR-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53 and
	[xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) pre-trained models
	by Facebook/Meta AI.

	This model is experimental in investigating better pre-trained models with more
	Welsh language speech that could in turn lower WER scores even further in subsequent
	fine-tuned models. __It is of very limited use for any fine-tuning on any useful downstream
	task such as speech recognition__.

	## First Attempts with Self-Supervised Learning

	Previous attempts drew heavilty on the resources and documentation from the HuggingFace examples
	for creating pre-trained wav2vec2 models from scratch:

	https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-pretraining

	we used only 4000 hours of Welsh and Engish speech audio collected from various channels on
	YouTube, The training set contained a balance of approximately 25% Welsh speech and 75%
	English language speech. The English language data however contains examples of Welsh-accented
	English speech and therefore was retained for pretraining.

	The results of our self-supervised attempts can be accessed from revisions `22.10` and `24.03` of
	this model repository.


	## Attempting with Fine-tuning Meta AI models with a very weak data set

	The latest attempt invesigates reverting back to fine-tuning Meta AI's pre-trained models (xls-r-1b)
	with the YouTube speech data having been transcribed automatically with the best Whisper based ASR
	models for Welsh and English: https://huggingface.co/techiaith/whisper-large-v3-ft-cv-cy-en

	The transcriptions are of course not totally correct, hence why we're termed it as a very weak data
	set. But since it has a much larger collection of speech, and much larger than [any other dataset for
	Welsh](https://huggingface.co/collections/techiaith/speech-recognition-datasets-672df8ffb3f7da8ed8294ce2)
	we wanted to nevertheless experiment with what impact (if any) the speech audio may still have on
	the wav2vec2 encoders.

	## Conclusion

	Until we have collected many more hours of speech,

	As already mentioned above, the model is not useful for any use. More hours of speech has to be collected.
	In the meantime, we have have identified issues and limitations in our YouTube data, such as the quality
	the speech audio and of the automatic transcriptions. Further work is required to correct those issues and/or
	if is a feasible dataset.