pyf98/OWSM_v3_demo · Apply for community grant: Academic project (gpu)

Owner Nov 21, 2023

Hi, this is a demo for our OWSM model, which is the first large-scale speech foundation model developed by the academic community.

OWSM is an Open Whisper-style Speech Model from CMU WAVLab. It reproduces OpenAI's Whisper-style training from scratch using publicly available data and an open-source toolkit, ESPnet. We publicly release all the scripts, trained models, and training logs to promote transparency and open science in large-scale speech pre-training. Our paper has been accepted at IEEE ASRU 2023: https://arxiv.org/pdf/2309.13876.pdf

OWSM v3 is designed to support the following tasks:

Speech recognition for 151 languages
Any-to-any language speech translation
Long-form transcription
Timestamp prediction
Language identification

Disclaimer: The model has not been thoroughly evaluated in all tasks. Due to limited training data, it may not perform well for low-resource language directions.

This model has 889M parameters and is trained on 180k hours of public ASR and ST data. The inference can be slow on CPU. We'd greatly appreciate it if we could get some support for GPU.

Our demo is also available on Colab if anyone is interested: https://colab.research.google.com/drive/1zKI3ZY_OtZd6YmVeED6Cxy1QwT1mqv9O?usp=sharing

Here is a short video which demonstrates the usage. OWSM even performs well for (zero-shot) contextual biasing ASR, where the candidate words are simply provided in the text prompt.

reach-vb

Nov 21, 2023

Happy to support this! @pyf98 - I've attached a T4 small for now! Hope that works?

pyf98

Owner Nov 21, 2023

Thank you for your quick reply and generous support! @reach-vb
I'll try it soon!