Spaces:
Sleeping
Apply for community grant: Academic project (gpu)
Hi, this is a demo for our OWSM model, which is the first large-scale speech foundation model developed by the academic community.
OWSM is an Open Whisper-style Speech Model from CMU WAVLab. It reproduces OpenAI's Whisper-style training from scratch using publicly available data and an open-source toolkit, ESPnet. We publicly release all the scripts, trained models, and training logs to promote transparency and open science in large-scale speech pre-training. Our paper has been accepted at IEEE ASRU 2023: https://arxiv.org/pdf/2309.13876.pdf
OWSM v3 is designed to support the following tasks:
- Speech recognition for 151 languages
- Any-to-any language speech translation
- Long-form transcription
- Timestamp prediction
- Language identification
Disclaimer: The model has not been thoroughly evaluated in all tasks. Due to limited training data, it may not perform well for low-resource language directions.
This model has 889M parameters and is trained on 180k hours of public ASR and ST data. The inference can be slow on CPU. We'd greatly appreciate it if we could get some support for GPU.
Our demo is also available on Colab if anyone is interested: https://colab.research.google.com/drive/1zKI3ZY_OtZd6YmVeED6Cxy1QwT1mqv9O?usp=sharing
Here is a short video which demonstrates the usage. OWSM even performs well for (zero-shot) contextual biasing ASR, where the candidate words are simply provided in the text prompt.