DewiBrynJones commited on
Commit
3842ddf
·
verified ·
1 Parent(s): bc79fa0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -7
README.md CHANGED
@@ -1,20 +1,56 @@
1
  ---
2
  license: apache-2.0
3
  language:
4
- - cy
5
  tags:
6
- - speech
 
 
7
  ---
8
 
9
- # Pre-training wav2vec2 models for Welsh speech recognition
10
 
11
- At the moment, the best Welsh speech recognition models are achieved from fine-tuning https://huggingface.co/facebook/wav2vec2-large-xlsr-53 and https://huggingface.co/facebook/wav2vec2-xls-r-1b models by Facebook/Meta AI.
 
 
 
12
 
13
- This model is experimental in investigating pretraining better models with more Welsh language speech that could lower WER scores even further in subsequently fine-tuned models. The work draws heavily on resources and documentation from the HuggingFace examples:
 
 
 
 
 
 
 
 
14
 
15
  https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-pretraining
16
 
17
- This base model has been pre-trained with only approximately 4000 hours of Welsh and English speech collected from various channels on YouTube. The corpus contains only 25% Welsh language speech. English language speech contains Welsh-accented English speech and therefore has been retained for pre-training.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
- Until we have collected many more hours of speech, this pre-trained model will be of limited use for fine-tuning any useful downstream tasks.
20
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  language:
4
+ - cy
5
  tags:
6
+ - speech
7
+ - pre-training
8
+ - wav2vec2
9
  ---
10
 
11
+ # Better Pre-trained wav2vec2 models for Welsh Speech Recognition
12
 
13
+ At the moment, the best Welsh speech recognition wav2vec2 models are achieved from
14
+ fine-tuning [XLSR-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53 and
15
+ [xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) pre-trained models
16
+ by Facebook/Meta AI.
17
 
18
+ This model is experimental in investigating better pre-trained models with more
19
+ Welsh language speech that could in turn lower WER scores even further in subsequent
20
+ fine-tuned models. __It is of very limited use for any fine-tuning on any useful downstream
21
+ task such as speech recognition__.
22
+
23
+ ## First Attempts with Self-Supervised Learning
24
+
25
+ Previous attempts drew heavilty on the resources and documentation from the HuggingFace examples
26
+ for creating pre-trained wav2vec2 models from scratch:
27
 
28
  https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-pretraining
29
 
30
+ we used only 4000 hours of Welsh and Engish speech audio collected from various channels on
31
+ YouTube, The training set contained a balance of approximately 25% Welsh speech and 75%
32
+ English language speech. The English language data however contains examples of Welsh-accented
33
+ English speech and therefore was retained for pretraining.
34
+
35
+ The results of our self-supervised attempts can be accessed from revisions `22.10` and `24.03` of
36
+ this model repository.
37
+
38
+
39
+ ## Attempting with Fine-tuning Meta AI models with a very weak data set
40
+
41
+ The latest attempt invesigates reverting back to fine-tuning Meta AI's pre-trained models (xls-r-1b)
42
+ with the YouTube speech data having been transcribed automatically with the best Whisper based ASR
43
+ models for Welsh and English: https://huggingface.co/techiaith/whisper-large-v3-ft-cv-cy-en
44
+
45
+ The transcriptions are of course not totally correct, hence why we're termed it as a very weak data
46
+ set. But since it has a much larger collection of speech, and much larger than [any other dataset for
47
+ Welsh](https://huggingface.co/collections/techiaith/speech-recognition-datasets-672df8ffb3f7da8ed8294ce2)
48
+ we wanted to nevertheless experiment with what impact (if any) the speech audio may still have on
49
+ the wav2vec2 encoders.
50
 
51
+ ## Conclusion
52
 
53
+ As already mentioned above, the model is not useful for any use. We have have identified many issues
54
+ and limitations, for example the quality of the YouTube data itself and in particular that of the
55
+ automatic transcriptions. Further work is required to confirm if the data and/or approaches attempted
56
+ thus far and viable and feasible.