File size: 1,989 Bytes
1e9cbae
 
 
 
 
 
 
 
 
 
 
e8e5b69
1e9cbae
 
 
 
 
 
 
 
 
 
 
f6f49e5
1e9cbae
 
e8e5b69
1e9cbae
e8e5b69
 
1e9cbae
e8e5b69
1e9cbae
e8e5b69
1e9cbae
 
 
e8e5b69
1e9cbae
 
 
e8e5b69
1e9cbae
e8e5b69
1e9cbae
e8e5b69
 
1e9cbae
 
e8e5b69
 
 
 
1e9cbae
e8e5b69
1e9cbae
e8e5b69
1e9cbae
 
 
 
 
e8e5b69
1e9cbae
 
 
 
 
 
8016209
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
language: en
datasets:
- librispeech_asr
tags:
- speech
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
license: apache-2.0
model-index:
- name: wav2vec2-conformer-rel-pos-large-960h-ft-4-gram
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Librispeech (clean)
      type: librispeech_asr
      args: en
    metrics:
    - name: Test WER
      type: wer
      value: 1.94
---

# Wav2Vec2-Conformer-Large-960h with Relative Position Embeddings + 4-gram

This model is identical to [Facebook's wav2vec2-conformer-rel-pos-large-960h-ft](https://huggingface.co/facebook/wav2vec2-conformer-rel-pos-large-960h-ft), but is 
augmented with an English 4-gram. The `4-gram.arpa.gz` of [Librispeech's official ngrams](https://www.openslr.org/11) is used.
 
 ## Evaluation
 
 This code snippet shows how to evaluate **patrickvonplaten/wav2vec2-conformer-rel-pos-large-960h-ft-4-gram** on LibriSpeech's "clean" and "other" test data.
 
```python
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torch
from jiwer import wer

model_id = "patrickvonplaten/wav2vec2-conformer-rel-pos-large-960h-ft-4-gram"

librispeech_eval = load_dataset("librispeech_asr", "other", split="test")

model = AutoModelForCTC.from_pretrained(model_id).to("cuda")
processor = AutoProcessor.from_pretrained(model_id)

def map_to_pred(batch):
    inputs = processor(batch["audio"]["array"], sampling_rate=16_000, return_tensors="pt")

    inputs = {k: v.to("cuda") for k,v in inputs.items()}

    with torch.no_grad():
        logits = model(**inputs).logits

    transcription = processor.batch_decode(logits.cpu().numpy()).text[0]
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])

print(wer(result["text"], result["transcription"]))
```

*Result (WER)*:

| "clean" | "other" |
|---|---|
| 1.94 | 3.54 |