File size: 3,965 Bytes
4213eef
 
 
 
 
 
 
 
 
c8a9eeb
4213eef
c8a9eeb
4213eef
 
 
 
 
a1e755c
 
 
 
4213eef
a1e755c
 
 
4213eef
a1e755c
 
4213eef
4b064a5
a1e755c
 
4213eef
a1e755c
 
 
4213eef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1e755c
 
 
 
 
 
 
4b064a5
a1e755c
4b064a5
a1e755c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
language: en
datasets:
- librispeech_asr
tags:
- audio
- automatic-speech-recognition
license: apache-2.0
widget:
- example_title: Librispeech sample 1
  src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- example_title: Librispeech sample 2
  src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
---

# Wav2Vec2-Base-960h

This repository is a reimplementation of [official Facebook’s wav2vec](https://huggingface.co/facebook/wav2vec2-base-960h).
There is no description of converting the wav2vec [pretrain model](https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20) to a pytorch.bin file.
We are rebuilding pytorch.bin from the pretrain model.
Here is the conversion method.

```bash
pip install transformers[sentencepiece]
pip install fairseq -U

git clone https://github.com/huggingface/transformers.git
cp transformers/src/transformers/models/wav2vec2/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py .

wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_960h.pt -O ./wav2vec_small_960h.pt
mkdir dict
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt

mkdir outputs
python convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py --pytorch_dump_folder_path ./outputs --checkpoint_path ./wav2vec_small_960h.pt --dict_path ./dict
```

# Usage

To transcribe audio files the model can be used as a standalone acoustic model as follows:

```python
 from transformers import Wav2Vec2Tokenizer, Wav2Vec2ForCTC
 from datasets import load_dataset
 import soundfile as sf
 import torch
 
 # load model and tokenizer
 tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
 model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
 
 # define function to read in sound file
 def map_to_array(batch):
     speech, _ = sf.read(batch["file"])
     batch["speech"] = speech
     return batch
     
 # load dummy dataset and read soundfiles
 ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
 ds = ds.map(map_to_array)
 
 # tokenize
 input_values = tokenizer(ds["speech"][:2], return_tensors="pt", padding="longest").input_values  # Batch size 1
 
 # retrieve logits
 logits = model(input_values).logits
 
 # take argmax and decode
 predicted_ids = torch.argmax(logits, dim=-1)
 transcription = tokenizer.batch_decode(predicted_ids)
 ```
 
 ## Evaluation
 
 This code snippet shows how to evaluate **facebook/wav2vec2-base-960h** on LibriSpeech's "clean" and "other" test data.
 
```python
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import soundfile as sf
import torch
from jiwer import wer


librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

librispeech_eval = librispeech_eval.map(map_to_array)

def map_to_pred(batch):
    input_values = tokenizer(batch["speech"], return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = tokenizer.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])

print("WER:", wer(result["text"], result["transcription"]))
```

*Result (WER)*:

| "clean" | "other" |
|---|---|
| 3.4 | 8.6 |


# Reference


[Facebook's Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)

[Facebook's huggingface Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h)

[Paper](https://arxiv.org/abs/2006.11477)