File size: 2,334 Bytes
cc9f5a4 52c5702 cc9f5a4 b285424 cc9f5a4 b285424 cc9f5a4 7c234f3 b285424 cc9f5a4 52c5702 cc9f5a4 52c5702 cc9f5a4 52c5702 cc9f5a4 52c5702 cc9f5a4 52c5702 cc9f5a4 52c5702 cc9f5a4 52c5702 cc9f5a4 52c5702 cc9f5a4 52c5702 cc9f5a4 52c5702 cc9f5a4 52c5702 cc9f5a4 52c5702 cc9f5a4 7c234f3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
---
language: en
datasets:
- librispeech_asr
tags:
- speech
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
license: apache-2.0
model-index:
- name: wav2vec2-conformer-rope-large-960h-ft-4-gram
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: LibriSpeech (clean)
type: librispeech_asr
config: clean
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 1.88
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: LibriSpeech (other)
type: librispeech_asr
config: other
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 3.57
---
# Wav2Vec2-Conformer-Large-960h with Rotary Position Embeddings + 4-gram
This model is identical to [Facebook's wav2vec2-conformer-rope-large-960h-ft](https://huggingface.co/facebook/wav2vec2-conformer-rope-large-960h-ft), but is
augmented with an English 4-gram. The `4-gram.arpa.gz` of [Librispeech's official ngrams](https://www.openslr.org/11) is used.
## Evaluation
This code snippet shows how to evaluate **patrickvonplaten/wav2vec2-conformer-rope-large-960h-ft-4-gram** on LibriSpeech's "clean" and "other" test data.
```python
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torch
from jiwer import wer
model_id = "patrickvonplaten/wav2vec2-conformer-rope-large-960h-ft-4-gram"
librispeech_eval = load_dataset("librispeech_asr", "other", split="test")
model = AutoModelForCTC.from_pretrained(model_id).to("cuda")
processor = AutoProcessor.from_pretrained(model_id)
def map_to_pred(batch):
inputs = processor(batch["audio"]["array"], sampling_rate=16_000, return_tensors="pt")
inputs = {k: v.to("cuda") for k,v in inputs.items()}
with torch.no_grad():
logits = model(**inputs).logits
transcription = processor.batch_decode(logits.cpu().numpy()).text[0]
batch["transcription"] = transcription
return batch
result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])
print(wer(result["text"], result["transcription"]))
```
*Result (WER)*:
| "clean" | "other" |
|---|---|
| 1.88 | 3.57 | |