swayam01 commited on
Commit
9f71671
1 Parent(s): 9a7ce6a
Files changed (1) hide show
  1. README.md +51 -0
README.md CHANGED
@@ -22,3 +22,54 @@ model-index:
22
  type: wer
23
  value: 24.17
24
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  type: wer
23
  value: 24.17
24
  ---
25
+
26
+ # hindi-clsril-100
27
+
28
+ Fine-tuned [Harveenchadha/wav2vec2-pretrained-clsril-23-10k](https://huggingface.co/Harveenchadha/wav2vec2-pretrained-clsril-23-10k) on Hindi using the [Common Voice](https://huggingface.co/datasets/common_voice), included [openSLR](http://www.openslr.org/103/) Hindi dataset.
29
+ When using this model, make sure that your speech input is sampled at 16kHz.
30
+
31
+ ## Evaluation
32
+ The model can be used directly (with or without a language model) as follows:
33
+
34
+ ```python
35
+ #!pip install datasets==1.4.1
36
+ #!pip install transformers==4.4.0
37
+ #!pip install torchaudio
38
+ #!pip install jiwer
39
+
40
+ import torch
41
+ import torchaudio
42
+ from datasets import load_dataset
43
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
44
+
45
+ test_dataset = load_dataset("common_voice", "hi", split="test")
46
+ wer = load_metric("wer")
47
+ chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\�\।\']'
48
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
49
+ # We need to read the audio files as arrays
50
+ def speech_file_to_array_fn(batch):
51
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
52
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
53
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
54
+ return batch
55
+
56
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
57
+
58
+ def evaluate(batch):
59
+ inputs = processor_with_lm(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
60
+ with torch.no_grad():
61
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
62
+ batch["pred_strings"] = transcription = processor_with_lm.batch_decode(logits.numpy()).text
63
+ return batch
64
+
65
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
66
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
67
+ ```
68
+
69
+ **Test Result**: 24.17 %
70
+
71
+ ## Training
72
+
73
+ The Common Voice hi `train`, `validation` were used for training, as well as openSLR hi `train`, `validation` and `test` datasets.
74
+
75
+ The script used for training can be found here [colab](https://colab.research.google.com/drive/1YL_csb3LRjqWybeyvQhZ-Hem2dtpvq_x?usp=sharing)