|
--- |
|
language: en |
|
datasets: |
|
- librispeech_asr |
|
tags: |
|
- speech |
|
- audio |
|
- automatic-speech-recognition |
|
- hf-asr-leaderboard |
|
license: apache-2.0 |
|
model-index: |
|
- name: wav2vec2-conformer-rope-large-960h-ft-4-gram |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: LibriSpeech (clean) |
|
type: librispeech_asr |
|
config: clean |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 1.88 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: LibriSpeech (other) |
|
type: librispeech_asr |
|
config: other |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 3.57 |
|
--- |
|
|
|
# Wav2Vec2-Conformer-Large-960h with Rotary Position Embeddings + 4-gram |
|
|
|
This model is identical to [Facebook's wav2vec2-conformer-rope-large-960h-ft](https://huggingface.co/facebook/wav2vec2-conformer-rope-large-960h-ft), but is |
|
augmented with an English 4-gram. The `4-gram.arpa.gz` of [Librispeech's official ngrams](https://www.openslr.org/11) is used. |
|
|
|
## Evaluation |
|
|
|
This code snippet shows how to evaluate **patrickvonplaten/wav2vec2-conformer-rope-large-960h-ft-4-gram** on LibriSpeech's "clean" and "other" test data. |
|
|
|
```python |
|
from datasets import load_dataset |
|
from transformers import AutoModelForCTC, AutoProcessor |
|
import torch |
|
from jiwer import wer |
|
|
|
model_id = "patrickvonplaten/wav2vec2-conformer-rope-large-960h-ft-4-gram" |
|
|
|
librispeech_eval = load_dataset("librispeech_asr", "other", split="test") |
|
|
|
model = AutoModelForCTC.from_pretrained(model_id).to("cuda") |
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
def map_to_pred(batch): |
|
inputs = processor(batch["audio"]["array"], sampling_rate=16_000, return_tensors="pt") |
|
|
|
inputs = {k: v.to("cuda") for k,v in inputs.items()} |
|
|
|
with torch.no_grad(): |
|
logits = model(**inputs).logits |
|
|
|
transcription = processor.batch_decode(logits.cpu().numpy()).text[0] |
|
batch["transcription"] = transcription |
|
return batch |
|
|
|
result = librispeech_eval.map(map_to_pred, remove_columns=["audio"]) |
|
|
|
print(wer(result["text"], result["transcription"])) |
|
``` |
|
|
|
*Result (WER)*: |
|
|
|
| "clean" | "other" | |
|
|---|---| |
|
| 1.88 | 3.57 | |