Whisper Model for Incorrect English Phrases

Overview

This fine-tuned version of OpenAI’s Whisper model is specifically trained to handle incorrect English phrases. It is designed to transcribe and process non-standard or erroneous English input, including mispronunciations, grammatical mistakes, slang, and non-native speaker errors. This model helps improve transcription accuracy in scenarios where speakers use incorrect or informal English, making it useful in language learning, transcription of casual conversations, or analyzing spoken communication from non-native English speakers.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 50
training_steps: 100000
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
0.9094	0.1270	500	0.6347	24.3686
0.5517	0.2541	1000	0.4835	18.0769
0.5364	0.3811	1500	0.4330	15.1149
0.5503	0.5081	2000	0.4113	13.6524
0.6521	0.6352	2500	0.3987	13.5897
0.6044	0.7622	3000	0.3912	13.0538
0.5487	0.8892	3500	0.3835	12.6119
0.5297	1.0163	4000	0.3791	12.4408
0.46	1.1433	4500	0.3751	12.3525
0.4947	1.2703	5000	0.3721	12.1415
0.524	1.3974	5500	0.3682	13.0139
0.4743	1.5244	6000	0.3649	13.3388
0.5338	1.6514	6500	0.3621	12.9397
0.5162	1.7785	7000	0.3597	13.3246
0.5004	1.9055	7500	0.3590	12.3268

Usage Guide

This project was executed on an Ubuntu 22.04.3 system running Linux kernel 6.8.0-40-generic.

Whisper large-v3 is supported in Hugging Face Transformers. To run the model, first install the Transformers library. For this example, we'll also install Hugging Face Datasets to load toy audio dataset from the Hugging Face Hub, and Hugging Face Accelerate to reduce the model loading time:

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

The model can be used with the pipeline class to transcribe audios of arbitrary length:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

def download_adapter_model():
  model_name = "whisper-v3-LoRA-en_students"
  print(f"Downloading the adapter model '{model_name}' from the Hugging Face Hub.", flush=True)

  # Define the path for the directory
  local_directory = os.path.expanduser("~/.cache/huggingface/hub")

  # Check if the directory exists
  if not os.path.exists(local_directory):
    # If it doesn't exist, create it
      os.makedirs(local_directory)
      print(f"Directory '{local_directory}' created.", flush=True)
  else:
    print(f"Directory '{local_directory}' already exists.", flush=True)

  repo_id = f"Transducens/{model_name}"
  repo_adapter_dir = f"{model_name}/checkpoint-5000/adapter_model"
  repo_filename_config = f"{repo_adapter_dir}/adapter_config.json"
  repo_filename_tensors = f"{repo_adapter_dir}/adapter_model.safetensors"

  adapter_config = hf_hub_download(repo_id=repo_id, filename=repo_filename_config, local_dir=local_directory)
  adapter_model_tensors = hf_hub_download(repo_id=repo_id, filename=repo_filename_tensors, local_dir=local_directory)

  print(f"Dowloaded the adapter model '{model_name}' from the Hugging Face Hub.", flush=True)

  return os.path.join(local_directory, repo_adapter_dir)

peft_model_id = adapter_path # Use the same model ID as before.
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
peft_config.base_model_name_or_path, load_in_8bit=False)

model = PeftModel.from_pretrained(model, peft_model_id)
model.generation_config.language = "<|en|>"
model.generation_config.task = "transcribe"

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-large-v3", task="transcribe")
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-large-v3")

pipe = pipeline(model=model, tokenizer=tokenizer, feature_extractor=feature_extractor, task="automatic-speech-recognition", device=device)


### Framework versions

- PEFT 0.11.1
- Transformers 4.42.4
- Pytorch 2.1.0+cu118
- Datasets 2.20.0
- Tokenizers 0.19.1

Transducens
/

error-preserving-whisper

Whisper Model for Incorrect English Phrases

Overview

Training procedure

Training hyperparameters

Training results

Usage Guide

Model tree for Transducens/error-preserving-whisper

Collection including Transducens/error-preserving-whisper

DeMINT