patrickvonplaten
commited on
Commit
·
e8e5b69
1
Parent(s):
1e9cbae
Update README.md
Browse files
README.md
CHANGED
@@ -9,7 +9,7 @@ tags:
|
|
9 |
- hf-asr-leaderboard
|
10 |
license: apache-2.0
|
11 |
model-index:
|
12 |
-
- name: wav2vec2-conformer-rel-pos-large-960h-ft
|
13 |
results:
|
14 |
- task:
|
15 |
name: Automatic Speech Recognition
|
@@ -21,89 +21,50 @@ model-index:
|
|
21 |
metrics:
|
22 |
- name: Test WER
|
23 |
type: wer
|
24 |
-
value:
|
25 |
---
|
26 |
|
27 |
-
# Wav2Vec2-Conformer-Large-960h with Relative Position Embeddings
|
28 |
|
29 |
-
[Facebook's
|
30 |
-
|
31 |
-
Wav2Vec2 Conformer with relative position embeddings, pretrained and fine-tuned on 960 hours of Librispeech on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz.
|
32 |
-
|
33 |
-
[Paper (TODO)](https://arxiv.org/abs/2006.11477)
|
34 |
-
|
35 |
-
Authors: ...
|
36 |
-
|
37 |
-
**Abstract**
|
38 |
-
|
39 |
-
...
|
40 |
-
|
41 |
-
The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.
|
42 |
-
|
43 |
-
|
44 |
-
# Usage
|
45 |
-
|
46 |
-
To transcribe audio files the model can be used as a standalone acoustic model as follows:
|
47 |
-
|
48 |
-
```python
|
49 |
-
from transformers import Wav2Vec2Processor, Wav2Vec2ConformerForCTC
|
50 |
-
from datasets import load_dataset
|
51 |
-
import torch
|
52 |
-
|
53 |
-
# load model and processor
|
54 |
-
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-conformer-rel-pos-large-960h-ft")
|
55 |
-
model = Wav2Vec2ConformerForCTC.from_pretrained("facebook/wav2vec2-conformer-rel-pos-large-960h-ft")
|
56 |
-
|
57 |
-
# load dummy dataset and read soundfiles
|
58 |
-
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
|
59 |
-
|
60 |
-
# tokenize
|
61 |
-
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values
|
62 |
|
63 |
-
|
64 |
-
logits = model(input_values).logits
|
65 |
|
66 |
-
|
67 |
-
predicted_ids = torch.argmax(logits, dim=-1)
|
68 |
-
transcription = processor.batch_decode(predicted_ids)
|
69 |
-
```
|
70 |
-
|
71 |
-
## Evaluation
|
72 |
-
|
73 |
-
This code snippet shows how to evaluate **facebook/wav2vec2-conformer-rel-pos-large-960h-ft** on LibriSpeech's "clean" and "other" test data.
|
74 |
|
75 |
```python
|
76 |
from datasets import load_dataset
|
77 |
-
from transformers import
|
78 |
import torch
|
79 |
from jiwer import wer
|
80 |
|
|
|
81 |
|
82 |
-
librispeech_eval = load_dataset("librispeech_asr", "
|
83 |
|
84 |
-
model =
|
85 |
-
processor =
|
86 |
|
87 |
def map_to_pred(batch):
|
88 |
-
inputs = processor(batch["audio"]["array"],
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
with torch.no_grad():
|
93 |
-
logits = model(
|
94 |
|
95 |
-
|
96 |
-
transcription = processor.batch_decode(predicted_ids)
|
97 |
batch["transcription"] = transcription
|
98 |
return batch
|
99 |
|
100 |
result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])
|
101 |
|
102 |
-
print(
|
103 |
```
|
104 |
|
105 |
*Result (WER)*:
|
106 |
|
107 |
| "clean" | "other" |
|
108 |
|---|---|
|
109 |
-
|
|
|
|
9 |
- hf-asr-leaderboard
|
10 |
license: apache-2.0
|
11 |
model-index:
|
12 |
+
- name: wav2vec2-conformer-rel-pos-large-960h-ft-4-gram
|
13 |
results:
|
14 |
- task:
|
15 |
name: Automatic Speech Recognition
|
|
|
21 |
metrics:
|
22 |
- name: Test WER
|
23 |
type: wer
|
24 |
+
value: --
|
25 |
---
|
26 |
|
27 |
+
# Wav2Vec2-Conformer-Large-960h with Relative Position Embeddings + 4-gram
|
28 |
|
29 |
+
This model is identical to [Facebook's wav2vec2-conformer-rel-pos-large-960h-ft](https://huggingface.co/facebook/wav2vec2-conformer-rel-pos-large-960h-ft), but is
|
30 |
+
augmented with an English 4-gram. The `4-gram.arpa.gz` of [Librispeech's official ngrams](https://www.openslr.org/11) is used.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
|
32 |
+
## Evaluation
|
|
|
33 |
|
34 |
+
This code snippet shows how to evaluate **patrickvonplaten/wav2vec2-conformer-rel-pos-large-960h-ft-4-gram** on LibriSpeech's "clean" and "other" test data.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
|
36 |
```python
|
37 |
from datasets import load_dataset
|
38 |
+
from transformers import AutoModelForCTC, AutoProcessor
|
39 |
import torch
|
40 |
from jiwer import wer
|
41 |
|
42 |
+
model_id = "patrickvonplaten/wav2vec2-conformer-rel-pos-large-960h-ft-4-gram"
|
43 |
|
44 |
+
librispeech_eval = load_dataset("librispeech_asr", "other", split="test")
|
45 |
|
46 |
+
model = AutoModelForCTC.from_pretrained(model_id).to("cuda")
|
47 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
48 |
|
49 |
def map_to_pred(batch):
|
50 |
+
inputs = processor(batch["audio"]["array"], sampling_rate=16_000, return_tensors="pt")
|
51 |
+
|
52 |
+
inputs = {k: v.to("cuda") for k,v in inputs.items()}
|
53 |
+
|
54 |
with torch.no_grad():
|
55 |
+
logits = model(**inputs).logits
|
56 |
|
57 |
+
transcription = processor.batch_decode(logits.cpu().numpy()).text[0]
|
|
|
58 |
batch["transcription"] = transcription
|
59 |
return batch
|
60 |
|
61 |
result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])
|
62 |
|
63 |
+
print(wer(result["text"], result["transcription"]))
|
64 |
```
|
65 |
|
66 |
*Result (WER)*:
|
67 |
|
68 |
| "clean" | "other" |
|
69 |
|---|---|
|
70 |
+
| -- | -- |
|