xls-asr-vi-40h-1B

This model is a fine-tuned version of facebook/wav2vec2-xls-r-1b on 40 hours of FPT Open Speech Dataset (FOSD) and Common Voice 7.0.

	VIVOS	COMMON VOICE 7.0	COMMON VOICE 8.0
without LM	25.93	34.21
with 4-grams LM	24.11	25.84	31.158

	VIVOS	COMMON VOICE 7.0	COMMON VOICE 8.0
without LM	9.24	19.94
with 4-grams LM	10.37	12.96	16.179

Evaluation

Please use the eval.py file to run the evaluation

python eval.py --model_id geninhu/xls-asr-vi-40h-1B --dataset mozilla-foundation/common_voice_7_0 --config vi --split test --log_outputs

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Wer
4.6222	1.85	1500	5.9479	0.5474
1.1362	3.7	3000	7.9799	0.5094
0.7814	5.56	4500	5.0330	0.4724
0.6281	7.41	6000	2.3484	0.5020
0.5472	9.26	7500	2.2495	0.4793
0.4827	11.11	9000	1.1530	0.4768
0.4327	12.96	10500	1.6160	0.4646
0.3989	14.81	12000	3.2633	0.4703
0.3522	16.67	13500	2.2337	0.4708
0.3201	18.52	15000	3.6879	0.4565
0.2899	20.37	16500	5.4389	0.4599
0.2776	22.22	18000	3.5284	0.4537
0.2574	24.07	19500	2.1759	0.4649
0.2378	25.93	21000	3.3901	0.4448
0.217	27.78	22500	1.1632	0.4565
0.2115	29.63	24000	1.7441	0.4232
0.1959	31.48	25500	3.4992	0.4304
0.187	33.33	27000	3.6163	0.4369
0.1748	35.19	28500	3.6038	0.4467
0.17	37.04	30000	2.9708	0.4362
0.159	38.89	31500	3.2045	0.4279
0.153	40.74	33000	3.2427	0.4287
0.1463	42.59	34500	3.5439	0.4270
0.139	44.44	36000	3.9381	0.4150
0.1352	46.3	37500	4.1744	0.4092
0.1369	48.15	39000	4.2279	0.4154
0.1273	50.0	40500	4.1691	0.4133