--- license: mit --- # WhisperD WhisperD is a fine-tuned version of whisper-large-v2 that is able to transcribe multi-speaker, conversational speech. It was used to generate synthetic transcriptions for training [Parakeet](https://jordandarefsky.com/blog/2024/parakeet/). Diarization is performed implicitly by the model, where "[S1]", "[S2]", etc. denote speaker identity. WhisperD is (often) able to transcribe non-speech events, e.g. "(coughs)", "(laughs)". Outputs include disfluencies. ### Example Output: ``` [S1] What's sort of cool is that, uh, you can produce coughs if you have to. [S2] What do you mean? [S1] Well, (coughs) there, I just coughed. ``` More details can be found in the [WhisperD blog post](https://jordandarefsky.com/blog/2024/parakeet/#spotify-dataset-and-whisperd). ### Caution: This model has only been tested on segments up to 30 seconds in length. It may be unable to handle conditioning on previous text, as this was not included during fine-tuning. Thus, if a pipeline / codebase uses this feature in order to transcribe audio with duration over 30 seconds, generation quality may be poor. ### Usage: ```python import torch import torchaudio from transformers import WhisperForConditionalGeneration, WhisperProcessor, WhisperTokenizer model = WhisperForConditionalGeneration.from_pretrained('jordand/whisper-d-v1a', torch_dtype=torch.float16).cuda() processor = WhisperProcessor.from_pretrained('openai/whisper-large-v2') tokenizer = WhisperTokenizer.from_pretrained('openai/whisper-large-v2') model.generation_config.suppress_tokens = None model.generation_config.forced_decoder_ids = None audio, sr = torchaudio.load('PATH_TO_AUDIO_FILE') audio = audio.mean(dim=0, keepdim=True) audio = torchaudio.transforms.Resample(sr, 16000)(audio) audio = audio[0, :16000*30] # whisper-d-v1 only can handle up to 30 seconds of audio inputs = processor(audio, return_tensors="pt") model_out = model.generate(inputs['input_features'].cuda().half()) text = tokenizer.decode(model_out[0], skip_special_tokens=True) print(text) ```