File size: 1,989 Bytes
1e9cbae e8e5b69 1e9cbae f6f49e5 1e9cbae e8e5b69 1e9cbae e8e5b69 1e9cbae e8e5b69 1e9cbae e8e5b69 1e9cbae e8e5b69 1e9cbae e8e5b69 1e9cbae e8e5b69 1e9cbae e8e5b69 1e9cbae e8e5b69 1e9cbae e8e5b69 1e9cbae e8e5b69 1e9cbae e8e5b69 1e9cbae 8016209 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
---
language: en
datasets:
- librispeech_asr
tags:
- speech
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
license: apache-2.0
model-index:
- name: wav2vec2-conformer-rel-pos-large-960h-ft-4-gram
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Librispeech (clean)
type: librispeech_asr
args: en
metrics:
- name: Test WER
type: wer
value: 1.94
---
# Wav2Vec2-Conformer-Large-960h with Relative Position Embeddings + 4-gram
This model is identical to [Facebook's wav2vec2-conformer-rel-pos-large-960h-ft](https://huggingface.co/facebook/wav2vec2-conformer-rel-pos-large-960h-ft), but is
augmented with an English 4-gram. The `4-gram.arpa.gz` of [Librispeech's official ngrams](https://www.openslr.org/11) is used.
## Evaluation
This code snippet shows how to evaluate **patrickvonplaten/wav2vec2-conformer-rel-pos-large-960h-ft-4-gram** on LibriSpeech's "clean" and "other" test data.
```python
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torch
from jiwer import wer
model_id = "patrickvonplaten/wav2vec2-conformer-rel-pos-large-960h-ft-4-gram"
librispeech_eval = load_dataset("librispeech_asr", "other", split="test")
model = AutoModelForCTC.from_pretrained(model_id).to("cuda")
processor = AutoProcessor.from_pretrained(model_id)
def map_to_pred(batch):
inputs = processor(batch["audio"]["array"], sampling_rate=16_000, return_tensors="pt")
inputs = {k: v.to("cuda") for k,v in inputs.items()}
with torch.no_grad():
logits = model(**inputs).logits
transcription = processor.batch_decode(logits.cpu().numpy()).text[0]
batch["transcription"] = transcription
return batch
result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])
print(wer(result["text"], result["transcription"]))
```
*Result (WER)*:
| "clean" | "other" |
|---|---|
| 1.94 | 3.54 | |