patrickvonplaten's picture
Update README.md
9901e0b
|
raw
history blame
2.55 kB
metadata
language: es
datasets:
  - common_voice
metrics:
  - wer
  - cer
tags:
  - audio
  - automatic-speech-recognition
  - speech
  - xlsr-fine-tuning-week
license: apache-2.0

Wav2Vec2-Large-XLSR-53-Spanish-With-LM

This is a model copy of Wav2Vec2-Large-XLSR-53-Spanish that has language model support.

This model card can be seen as a demo for the pyctcdecode integration with Transformers led by this PR. The PR explains in-detail how the integration works.

In a nutshell: This PR adds a new Wav2Vec2WithLMProcessor class as drop-in replacement for Wav2Vec2Processor.

The only change from the existing ASR pipeline will be:

import torch
import torchaudio.functional as F
-from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
from datasets import load_dataset

ds = load_dataset("common_voice", "es", split="test", streaming=True)

sample = next(iter(ds))

resampled_audio = F.resample(torch.tensor(sample["audio"]["array"]), 48_000, 16_000).n

model = Wav2Vec2ForCTC.from_pretrained("patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm")
-processor = Wav2Vec2Processor.from_pretrained("patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm")
+processor = Wav2Vec2ProcessorWithLM.from_pretrained("patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm")

input_values = processor(resampled_audio, return_tensors="pt").input_values

with torch.no_grad():
    logits = model(input_values).logits

-prediction_ids = torch.argmax(logits, dim=-1)
-transcription = processor.batch_decode(prediction_ids)
+transcription = processor.batch_decode(logits.cpu().numpy()).text

print(transcription)

Improvement

This model has been compared on 512 speech samples from the Spanish Common Voice Test set and gives a nice 20 % performance boost:

The results can be reproduced by running from this model repository:

Model WER CER
patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm 8.44% 2.93%
jonatasgrosman/wav2vec2-large-xlsr-53-spanish 10.20% 3.24%
bash run_ngram_wav2vec2.py 1 512
bash run_ngram_wav2vec2.py 0 512

with run_ngram_wav2vec2.py being https://huggingface.co/patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm/blob/main/run_ngram_wav2vec2.py