nguyenvulebinh commited on
Commit
5af41c2
1 Parent(s): e01589f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -73
README.md CHANGED
@@ -2,6 +2,7 @@
2
  language: vi
3
  datasets:
4
  - VLSP 2020 ASR dataset
 
5
  tags:
6
  - audio
7
  - automatic-speech-recognition
@@ -15,100 +16,51 @@ widget:
15
  src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t2_0000006682.wav
16
  ---
17
 
18
- # Wav2Vec2-Base-960h
19
 
20
  [Facebook's Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)
21
 
22
- The base model pretrained and fine-tuned on 960 hours of Librispeech on 16kHz sampled speech audio. When using the model
23
  make sure that your speech input is also sampled at 16Khz.
24
 
25
- [Paper](https://arxiv.org/abs/2006.11477)
26
-
27
- Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli
28
-
29
- **Abstract**
30
-
31
- We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.
32
-
33
- The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.
34
-
35
-
36
  # Usage
37
 
38
  To transcribe audio files the model can be used as a standalone acoustic model as follows:
39
 
40
  ```python
41
- from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
42
- from datasets import load_dataset
43
- import soundfile as sf
44
- import torch
45
-
46
- # load model and tokenizer
47
- processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
48
- model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
49
-
50
- # define function to read in sound file
51
- def map_to_array(batch):
52
- speech, _ = sf.read(batch["file"])
53
- batch["speech"] = speech
54
- return batch
55
-
56
- # load dummy dataset and read soundfiles
57
- ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
58
- ds = ds.map(map_to_array)
59
-
60
- # tokenize
61
- input_values = processor(ds["speech"][:2], return_tensors="pt", padding="longest").input_values # Batch size 1
62
-
63
- # retrieve logits
64
- logits = model(input_values).logits
65
-
66
- # take argmax and decode
67
- predicted_ids = torch.argmax(logits, dim=-1)
68
- transcription = processor.batch_decode(predicted_ids)
69
- ```
70
-
71
- ## Evaluation
72
-
73
- This code snippet shows how to evaluate **facebook/wav2vec2-base-960h** on LibriSpeech's "clean" and "other" test data.
74
-
75
- ```python
76
  from datasets import load_dataset
77
- from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
78
  import soundfile as sf
79
  import torch
80
- from jiwer import wer
81
-
82
-
83
- librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
84
 
85
- model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
86
- processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
 
87
 
 
88
  def map_to_array(batch):
89
  speech, _ = sf.read(batch["file"])
90
  batch["speech"] = speech
91
  return batch
92
 
93
- librispeech_eval = librispeech_eval.map(map_to_array)
 
 
 
94
 
95
- def map_to_pred(batch):
96
- input_values = processor(batch["speech"], return_tensors="pt", padding="longest").input_values
97
- with torch.no_grad():
98
- logits = model(input_values.to("cuda")).logits
99
 
100
- predicted_ids = torch.argmax(logits, dim=-1)
101
- transcription = processor.batch_decode(predicted_ids)
102
- batch["transcription"] = transcription
103
- return batch
104
-
105
- result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])
106
-
107
- print("WER:", wer(result["text"], result["transcription"]))
108
- ```
109
 
110
- *Result (WER)*:
 
 
 
 
 
111
 
112
- | "clean" | "other" |
113
- |---|---|
114
- | 3.4 | 8.6 |
 
2
  language: vi
3
  datasets:
4
  - VLSP 2020 ASR dataset
5
+ - VIVOS
6
  tags:
7
  - audio
8
  - automatic-speech-recognition
 
16
  src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t2_0000006682.wav
17
  ---
18
 
19
+ # Wav2Vec2-Base-250h for the Vietnamese language
20
 
21
  [Facebook's Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)
22
 
23
+ The base model pretrained and fine-tuned on 250 hours of VLSP ASR dataset on 16kHz sampled speech audio. When using the model
24
  make sure that your speech input is also sampled at 16Khz.
25
 
 
 
 
 
 
 
 
 
 
 
 
26
  # Usage
27
 
28
  To transcribe audio files the model can be used as a standalone acoustic model as follows:
29
 
30
  ```python
31
+ from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  from datasets import load_dataset
 
33
  import soundfile as sf
34
  import torch
 
 
 
 
35
 
36
+ # load model and tokenizer
37
+ processor = Wav2Vec2Processor.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h")
38
+ model = Wav2Vec2ForCTC.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h")
39
 
40
+ # define function to read in sound file
41
  def map_to_array(batch):
42
  speech, _ = sf.read(batch["file"])
43
  batch["speech"] = speech
44
  return batch
45
 
46
+ # load dummy dataset and read soundfiles
47
+ ds = map_to_array({
48
+ "file": 'audio-test/t1_0001-00010.wav'
49
+ })
50
 
51
+ # tokenize
52
+ input_values = processor(ds["speech"], return_tensors="pt", padding="longest").input_values # Batch size 1
 
 
53
 
54
+ # retrieve logits
55
+ logits = model(input_values).logits
 
 
 
 
 
 
 
56
 
57
+ # take argmax and decode
58
+ predicted_ids = torch.argmax(logits, dim=-1)
59
+ transcription = processor.batch_decode(predicted_ids)
60
+ ```
61
+
62
+ *Result WER (with 4-grams LM)*:
63
 
64
+ | "VIVOS" | "VLSP-T1" | "VLSP-T2" |
65
+ |---|---|---|
66
+ | 6.1 | 9.1 | 40.8 |