File size: 5,044 Bytes
e888c3f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3139cc0
 
7e01856
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3139cc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88716e0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3139cc0
88716e0
3139cc0
88716e0
 
 
 
3139cc0
88716e0
 
 
3139cc0
88716e0
 
3139cc0
88716e0
3139cc0
7e01856
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
license: apache-2.0
language:
- en
base_model:
- openai/whisper-large-v3
metrics:
- accuracy
---

# Whisper Model for Incorrect English Phrases

## Overview

This fine-tuned version of OpenAI’s Whisper model is specifically trained to handle incorrect English phrases. 
It is designed to transcribe and process non-standard or erroneous English input, including mispronunciations,
grammatical mistakes, slang, and non-native speaker errors. This model helps improve transcription accuracy 
in scenarios where speakers use incorrect or informal English, making it useful in language learning, 
transcription of casual conversations, or analyzing spoken communication from non-native English speakers.

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 50
- training_steps: 100000
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch  | Step | Validation Loss | Wer     |
|:-------------:|:------:|:----:|:---------------:|:-------:|
| 0.9094        | 0.1270 | 500  | 0.6347          | 24.3686 |
| 0.5517        | 0.2541 | 1000 | 0.4835          | 18.0769 |
| 0.5364        | 0.3811 | 1500 | 0.4330          | 15.1149 |
| 0.5503        | 0.5081 | 2000 | 0.4113          | 13.6524 |
| 0.6521        | 0.6352 | 2500 | 0.3987          | 13.5897 |
| 0.6044        | 0.7622 | 3000 | 0.3912          | 13.0538 |
| 0.5487        | 0.8892 | 3500 | 0.3835          | 12.6119 |
| 0.5297        | 1.0163 | 4000 | 0.3791          | 12.4408 |
| 0.46          | 1.1433 | 4500 | 0.3751          | 12.3525 |
| 0.4947        | 1.2703 | 5000 | 0.3721          | 12.1415 |
| 0.524         | 1.3974 | 5500 | 0.3682          | 13.0139 |
| 0.4743        | 1.5244 | 6000 | 0.3649          | 13.3388 |
| 0.5338        | 1.6514 | 6500 | 0.3621          | 12.9397 |
| 0.5162        | 1.7785 | 7000 | 0.3597          | 13.3246 |
| 0.5004        | 1.9055 | 7500 | 0.3590          | 12.3268 |

## Usage Guide

This project was executed on an Ubuntu 22.04.3 system running Linux kernel 6.8.0-40-generic.

Whisper large-v3 is supported in Hugging Face Transformers. To run the model, first install the Transformers library. 
For this example, we'll also install Hugging Face Datasets to load toy audio dataset from 
the Hugging Face Hub, and Hugging Face  Accelerate to reduce the model loading time:

```bash
pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate
```

The model can be used with the pipeline class to transcribe audios of arbitrary length:

```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

def download_adapter_model():
  model_name = "whisper-v3-LoRA-en_students"
  print(f"Downloading the adapter model '{model_name}' from the Hugging Face Hub.", flush=True)

  # Define the path for the directory
  local_directory = os.path.expanduser("~/.cache/huggingface/hub")

  # Check if the directory exists
  if not os.path.exists(local_directory):
    # If it doesn't exist, create it
      os.makedirs(local_directory)
      print(f"Directory '{local_directory}' created.", flush=True)
  else:
    print(f"Directory '{local_directory}' already exists.", flush=True)

  repo_id = f"Transducens/{model_name}"
  repo_adapter_dir = f"{model_name}/checkpoint-5000/adapter_model"
  repo_filename_config = f"{repo_adapter_dir}/adapter_config.json"
  repo_filename_tensors = f"{repo_adapter_dir}/adapter_model.safetensors"

  adapter_config = hf_hub_download(repo_id=repo_id, filename=repo_filename_config, local_dir=local_directory)
  adapter_model_tensors = hf_hub_download(repo_id=repo_id, filename=repo_filename_tensors, local_dir=local_directory)

  print(f"Dowloaded the adapter model '{model_name}' from the Hugging Face Hub.", flush=True)

  return os.path.join(local_directory, repo_adapter_dir)

peft_model_id = adapter_path # Use the same model ID as before.
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
peft_config.base_model_name_or_path, load_in_8bit=False)

model = PeftModel.from_pretrained(model, peft_model_id)
model.generation_config.language = "<|en|>"
model.generation_config.task = "transcribe"

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-large-v3", task="transcribe")
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-large-v3")

pipe = pipeline(model=model, tokenizer=tokenizer, feature_extractor=feature_extractor, task="automatic-speech-recognition", device=device)


### Framework versions

- PEFT 0.11.1
- Transformers 4.42.4
- Pytorch 2.1.0+cu118
- Datasets 2.20.0
- Tokenizers 0.19.1