File size: 4,548 Bytes
0289adf 437560f 0289adf 8f3f527 0289adf 8f3f527 0289adf 8f3f527 8d5e0cd 0289adf 80f132f 0289adf 8f3f527 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
---
language: en
datasets:
- librispeech_asr
tags:
- speech
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
license: mit
pipeline_tag: automatic-speech-recognition
widget:
- example_title: Librispeech sample 1
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- example_title: Librispeech sample 2
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
model-index:
- name: s2t-small-librispeech-asr
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: LibriSpeech (clean)
type: librispeech_asr
config: clean
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 4.3
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: LibriSpeech (other)
type: librispeech_asr
config: other
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 9
library_name: transformers
---
# SWRA (SWARA)
`SWRA (SWARA)` is a Speech to Text Transformer (S2T) model trained by @binarybardakshat for automatic speech recognition (ASR).
## Model Description
SWRA (SWARA) is an end-to-end sequence-to-sequence transformer model. It is trained with standard autoregressive cross-entropy loss and generates the transcripts autoregressively.
### How to Use
As this is a standard sequence-to-sequence transformer model, you can use the `generate` method to generate the transcripts by passing the speech features to the model.
*Note: The `Speech2TextProcessor` object uses [torchaudio](https://github.com/pytorch/audio) to extract the filter bank features. Make sure to install the `torchaudio` package before running this example.*
*Note: The feature extractor depends on [torchaudio](https://github.com/pytorch/audio) and the tokenizer depends on [sentencepiece](https://github.com/google/sentencepiece), so be sure to install those packages before running the examples.*
You could either install those as extra speech dependencies with `pip install transformers"[speech, sentencepiece]"` or install the packages separately with `pip install torchaudio sentencepiece`.
```python
import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from datasets import load_dataset
model = Speech2TextForConditionalGeneration.from_pretrained("binarybardakshat/swra-swara")
processor = Speech2TextProcessor.from_pretrained("binarybardakshat/swra-swara")
ds = load_dataset(
"patrickvonplaten/librispeech_asr_dummy",
"clean",
split="validation"
)
input_features = processor(
ds[0]["audio"]["array"],
sampling_rate=16_000,
return_tensors="pt"
).input_features # Batch size 1
generated_ids = model.generate(input_features=input_features)
transcription = processor.batch_decode(generated_ids)
#### Evaluation on LibriSpeech Test
The following script shows how to evaluate this model on the [LibriSpeech](https://huggingface.co/datasets/librispeech_asr)
*"clean"* and *"other"* test dataset.
```python
from datasets import load_dataset
from evaluate import load
from transformers import Speech2TextForConditionalGeneration, Speech2TextProcessor
librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") # change to "other" for other test dataset
wer = load("wer")
model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr").to("cuda")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr", do_upper_case=True)
def map_to_pred(batch):
features = processor(batch["audio"]["array"], sampling_rate=16000, padding=True, return_tensors="pt")
input_features = features.input_features.to("cuda")
attention_mask = features.attention_mask.to("cuda")
gen_tokens = model.generate(input_features=input_features, attention_mask=attention_mask)
batch["transcription"] = processor.batch_decode(gen_tokens, skip_special_tokens=True)[0]
return batch
result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])
print("WER:", wer.compute(predictions=result["transcription"], references=result["text"]))
```
*Result (WER)*:
| "clean" | "other" |
|:-------:|:-------:|
| 4.3 | 9.0 |
## Training data
The S2T-SMALL-LIBRISPEECH-ASR is trained on [LibriSpeech ASR Corpus](https://www.openslr.org/12), a dataset consisting of
approximately 1000 hours of 16kHz read English speech. |