Apply for community grant: Academic project (gpu)

#1
by pyf98 - opened

Hi, this is a demo for our OWSM model, which is the first large-scale speech foundation model developed by the academic community.

OWSM is an Open Whisper-style Speech Model from CMU WAVLab. It reproduces OpenAI's Whisper-style training from scratch using publicly available data and an open-source toolkit, ESPnet. We publicly release all the scripts, trained models, and training logs to promote transparency and open science in large-scale speech pre-training. Our paper has been accepted at IEEE ASRU 2023: https://arxiv.org/pdf/2309.13876.pdf

OWSM v3 is designed to support the following tasks:

  • Speech recognition for 151 languages
  • Any-to-any language speech translation
  • Long-form transcription
  • Timestamp prediction
  • Language identification

Disclaimer: The model has not been thoroughly evaluated in all tasks. Due to limited training data, it may not perform well for low-resource language directions.

This model has 889M parameters and is trained on 180k hours of public ASR and ST data. The inference can be slow on CPU. We'd greatly appreciate it if we could get some support for GPU.

Our demo is also available on Colab if anyone is interested: https://colab.research.google.com/drive/1zKI3ZY_OtZd6YmVeED6Cxy1QwT1mqv9O?usp=sharing

Here is a short video which demonstrates the usage. OWSM even performs well for (zero-shot) contextual biasing ASR, where the candidate words are simply provided in the text prompt.

Happy to support this! @pyf98 - I've attached a T4 small for now! Hope that works?

Thank you for your quick reply and generous support! @reach-vb
I'll try it soon!

Sign up or log in to comment