Whisper Small Model Card
Whisper Small is a pre-trained model for automatic speech recognition (ASR) and speech translation. It is a Transformer-based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 680k hours of labelled speech data annotated using large-scale weak supervision. The model has 244 million parameters and is multilingual
Performance
Whisper Small has a high accuracy and can generalize well to many datasets and domains without the need for fine-tuning.
Usage
To transcribe audio samples, the model has to be used alongside a WhisperProcessor. The WhisperProcessor is used to pre-process the audio inputs (converting them to log-Mel spectrograms for the model) and post-process the model outputs (converting them from tokens to text).
References
- ** https://huggingface.co/openai/whisper-small
- ** https://github.com/openai/whisper
- ** https://openai.com/research/whisper
- ** https://www.assemblyai.com/blog/how-to-run-openais-whisper-speech-recognition-model/
Model Details
Whisper is a transformer-based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper large-v2.
The models were trained on either English-only data or multilingual data. The English-only models were trained on the task of speech recognition. The multilingual models were trained on both speech recognition and speech translation. For speech recognition, the model predicts transcriptions in the same language as the audio. For speech translation, the model predicts transcriptions to a different language to the audio.
Uses
- Transcription
- Translation
Training hyperparameters
- learning_rate: 1e-5
- train_batch_size: 8
- eval_batch_size: 8
- lr_scheduler_warmup_steps: 500
- max_steps: 4000
- metric_for_best_model: wer
- Downloads last month
- 18