File size: 2,364 Bytes
0b5509c 43b9e58 0b5509c 43b9e58 0b5509c 43b9e58 0b5509c 43b9e58 0b5509c 43b9e58 0b5509c 5569ff8 43b9e58 8558f54 43b9e58 873b685 0b5509c 873b685 0b5509c 873b685 9901e0b 873b685 0b5509c 9901e0b 0b5509c 8558f54 0b5509c 2502577 873b685 0b5509c e479cfb 0b5509c e479cfb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
---
language: es
datasets:
- common_voice
metrics:
- wer
- cer
tags:
- audio
- automatic-speech-recognition
- speech
- xlsr-fine-tuning-week
license: apache-2.0
---
# Wav2Vec2-Large-XLSR-53-Spanish-With-LM
This is a model copy of [Wav2Vec2-Large-XLSR-53-Spanish](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-spanish)
that has language model support.
This model card can be seen as a demo for the [pyctcdecode](https://github.com/kensho-technologies/pyctcdecode) integration
with Transformers led by [this PR](https://github.com/huggingface/transformers/pull/14339). The PR explains in-detail how the
integration works.
In a nutshell: This PR adds a new Wav2Vec2WithLMProcessor class as drop-in replacement for Wav2Vec2Processor.
The only change from the existing ASR pipeline will be:
## Changes
```diff
import torch
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torchaudio.functional as F
model_id = "patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm"
sample = next(iter(load_dataset("common_voice", "es", split="test", streaming=True)))
resampled_audio = F.resample(torch.tensor(sample["audio"]["array"]), 48_000, 16_000).numpy()
model = AutoModelForCTC.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
input_values = processor(resampled_audio, return_tensors="pt").input_values
with torch.no_grad():
logits = model(input_values).logits
-prediction_ids = torch.argmax(logits, dim=-1)
-transcription = processor.batch_decode(prediction_ids)
+transcription = processor.batch_decode(logits.numpy()).text
# => 'bien y qué regalo vas a abrir primero'
```
**Improvement**
This model has been compared on 512 speech samples from the Spanish Common Voice Test set and
gives a nice *20 %* performance boost:
The results can be reproduced by running *from this model repository*:
| Model | WER | CER |
| ------------- | ------------- | ------------- |
| patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm | **8.44%** | **2.93%** |
| jonatasgrosman/wav2vec2-large-xlsr-53-spanish | **10.20%** | **3.24%** |
```
bash run_ngram_wav2vec2.py 1 512
```
```
bash run_ngram_wav2vec2.py 0 512
```
with `run_ngram_wav2vec2.py` being
https://huggingface.co/patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm/blob/main/run_ngram_wav2vec2.py |