File size: 2,385 Bytes
0b5509c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43b9e58
0b5509c
43b9e58
 
0b5509c
43b9e58
 
 
0b5509c
43b9e58
0b5509c
43b9e58
0b5509c
43b9e58
2502577
 
43b9e58
0b5509c
43b9e58
0b5509c
43b9e58
0b5509c
43b9e58
2502577
 
0b5509c
43b9e58
0b5509c
43b9e58
0b5509c
2502577
 
7ef592a
0b5509c
43b9e58
0b5509c
 
e479cfb
 
 
 
 
 
0b5509c
 
 
e479cfb
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
language: es
datasets:
- common_voice
metrics:
- wer
- cer
tags:
- audio
- automatic-speech-recognition
- speech
- xlsr-fine-tuning-week
license: apache-2.0
---

# Wav2Vec2-Large-XLSR-53-Spanish-With-LM

This is a model copy of [Wav2Vec2-Large-XLSR-53-Spanish](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-spanish)
that has language model support. 

This model card can be seen as a demo for the [pyctcdecode](https://github.com/kensho-technologies/pyctcdecode) integration 
with Transformers led by [this PR](https://github.com/huggingface/transformers/pull/14339). The PR explains in-detail how the 
integration works. 

In a nutshell: This PR adds a new Wav2Vec2WithLMProcessor class as drop-in replacement for Wav2Vec2Processor.

The only change from the existing ASR pipeline will be:

```diff
-from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
from datasets import load_dataset

ds = load_dataset("common_voice", "es", split="test", streaming=True)

sample = next(iter(ds))

model = Wav2Vec2ForCTC.from_pretrained("patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm")
-processor = Wav2Vec2Processor.from_pretrained("patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm")
+processor = Wav2Vec2ProcessorWithLM.from_pretrained("patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm")

input_values = processor(sample["audio"]["array"], return_tensors="pt").input_values

logits = model(input_values).logits

-prediction_ids = torch.argmax(logits, dim=-1)
-transcription = processor.batch_decode(prediction_ids)
+transcription = processor.batch_decode(logits).text

print(transcription)
```

**Improvement**

This model has been compared on 512 speech samples from the Spanish Common Voice Test set and 
gives a nice *20 %* performance boost:

The results can be reproduced by running *from this model repository*:

| Model | WER | CER |
| ------------- | ------------- | ------------- |
| patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm | **8.44%** | **2.93%** |
| jonatasgrosman/wav2vec2-large-xlsr-53-spanish | **10.20%** | **3.24%** |

```
bash run_ngram_wav2vec2.py 1 512
```

```
bash run_ngram_wav2vec2.py 0 512
```

with `run_ngram_wav2vec2.py` being 
https://huggingface.co/patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm/blob/main/run_ngram_wav2vec2.py