metadata
license: apache-2.0
language:
- en
- zh
- de
- es
- ru
- ko
- fr
- ja
- pt
- tr
- pl
- ca
- nl
- ar
- sv
- it
- id
- hi
- fi
- vi
- he
- uk
- el
- ms
- cs
- ro
- da
- hu
- ta
- 'no'
- th
- ur
- hr
- bg
- lt
- la
- mi
- ml
- cy
- sk
- te
- fa
- lv
- bn
- sr
- az
- sl
- kn
- et
- mk
- br
- eu
- is
- hy
- ne
- mn
- bs
- kk
- sq
- sw
- gl
- mr
- pa
- si
- km
- sn
- yo
- so
- af
- oc
- ka
- be
- tg
- sd
- gu
- am
- yi
- lo
- uz
- fo
- ht
- ps
- tk
- nn
- mt
- sa
- lb
- my
- bo
- tl
- mg
- as
- tt
- haw
- ln
- ha
- ba
- jw
- su
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
base_model: openai/whisper-small
pipeline_tag: automatic-speech-recognition
model-index:
- name: whisper-small
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: LibriSpeech (clean)
type: librispeech_asr
config: clean
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 3.432213777886737
Whisper-small OpenVINO IR
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning.
Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al from OpenAI. The original code repository can be found here.
Disclaimer: Content for this model card has partly been copied and pasted from this model card.
Model details
Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. This model architecture is used in THIS REPO(Intel)
Model Type | n_vocab | n_audio_ctx | n_audio_state | n_audio_head | n_audio_layer | n_text_ctx | n_text_state | n_text_head | n_text_layer | n_mels | Parameters |
---|---|---|---|---|---|---|---|---|---|---|---|
whisper_tiny | 51864 | 1500 | 512 | 6 | 4 | 128 | 512 | 6 | 4 | 80 | 39 M |
whisper_tiny.en | 51864 | 1500 | 512 | 6 | 4 | 128 | 512 | 6 | 4 | 80 | 39 M |
whisper_base | 51864 | 1500 | 512 | 8 | 6 | 128 | 512 | 8 | 6 | 80 | 74 M |
whisper_base.en | 51864 | 1500 | 512 | 8 | 6 | 128 | 512 | 8 | 6 | 80 | 74 M |
whisper_small | 51864 | 1500 | 512 | 12 | 12 | 128 | 512 | 12 | 12 | 80 | 244 M |
whisper_small.en | 51864 | 1500 | 512 | 12 | 12 | 128 | 512 | 12 | 12 | 80 | 244 M |
whisper_medium | 51864 | 1500 | 512 | 16 | 24 | 128 | 512 | 16 | 16 | 80 | 769 M |
whisper_medium.en | 51864 | 1500 | 512 | 16 | 24 | 128 | 512 | 16 | 16 | 80 | 769 M |
whisper_large_v1 | 51864 | 1500 | 512 | 20 | 32 | 128 | 512 | 20 | 20 | 80 | 1550 M |
whisper_large_v2 | 51864 | 1500 | 512 | 20 | 32 | 128 | 512 | 20 | 20 | 80 | 1550 M |
whisper_large_v3 | 51864 | 1500 | 512 | 20 | 32 | 128 | 512 | 20 | 20 | 80 | 1550 M |