Spaces:

openai
/

whisper

Running on L40S

App Files Files Community

132

WER = 100% !!

#84

by Seyfelislem - opened Apr 18, 2023

Discussion

Seyfelislem

Apr 18, 2023

•

edited Apr 18, 2023

Hello everyone,

I am having an issue when finetuning OpenAI's Whisper Medium on Mozilla's Common Voice 11 Dataset with the Arabic language.
The training and validation loss are both decreasing but the WER is being 100% after some steps (specially when the loss becomes < 1) and I see that the model is performing well and that WER is just miscalculated.

Notes :

This error is just happening with the medium model, other models (small, tiny, large-v2 ,etc.) are working fine.
I am following the famous blog about Whisper's finetuning (https://huggingface.co/blog/fine-tune-whisper).

lnpwcd68730

Apr 22, 2023

I met the same question

sanchit-gandhi

Apr 24, 2023

Hey @Seyfelislem and @lnpwcd68730 ! Thank you both for reporting this issue. You might be interested in checking out the Whisper leaderboard for finding the most performant fine-tuned Whisper checkpoints in your language: https://huggingface.co/spaces/whisper-event/leaderboard?dataset=mozilla-foundation%2Fcommon_voice_11_0&config=ar&split=test

Good to see that the eval loss is still decreasing (it's pretty easy to overfit with Whisper fine-tuning). For the WER issue, what we can do is save the references and predictions to a .txt file, and inspect them to see what sorts of errors the model is making. To do this, you can amend the compute_metrics function as follows:

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    # save references and predictions to a txt file for debugging
    with open('refs_and_preds.txt', 'w') as f:
    for ref, pred in zip(label_str, pred_str)
        f.write(f"Ref: {ref}\n")
        f.write(f"Pred: {pred}\n\n")

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

Seyfelislem

Apr 26, 2023

Hey @sanchit-gandhi , @amyeroberts
Thank you for you answers here and in github.
(I'd like to proceed with my question here on Hugging Face.)

So, I tried your suggestion by modifying compute_metrics, and it seems that the transcriptions generated by the model are sometimes in the Arabic language (original and buckwalter) and sometimes are translated to an other language (French, Russian, Chinese ,etc) and sometimes even empty!

Here are the results of the transformers-cli env command:

transformers version: 4.28.1
Platform: Linux-5.10.147+-x86_64-with-glibc2.31
Python version: 3.9.16
Huggingface_hub version: 0.14.1
Safetensors version: not installed
PyTorch version (GPU?): 2.0.0+cu118 (True)
Tensorflow version (GPU?): 2.12.0 (True)
Flax version (CPU?/GPU?/TPU?): 0.6.8 (gpu)
Jax version: 0.4.8
JaxLib version: 0.4.7
Using GPU in script?: (True)

sanchit-gandhi

Apr 27, 2023

•

edited Apr 27, 2023

Hey @Seyfelislem - can you do two things please:

Could you verify that you set the language correctly in your tokenizer and processor (i.e. that you set language="Arabic" and task="transcribe" in both the tokenizer and processor)
Secondly, could you add this line right after you set forced_decoder_ids=None (in this section:

model.generate = partial(model.generate, language="arabic", task="transcribe")

This will now force the model always to predict in Arabic.

Seyfelislem

Apr 28, 2023

Hey again @sanchit-gandhi
I can ensure that I set the language correctly in the tokenizer and the processor.
Now, after adding the line model.generate = partial(model.generate, language="arabic", task="transcribe"), there are no more problems with WER.

Thank you very much for your efforts.

Miroo222

Apr 30, 2023

Hello @sanchit-gandhi ,

I have same problem with WER approaching to 100% for Czech language.

After I added following two lines:
from functools import partial .... model.generate = partial(model.generate, language="Czech", task="transcribe")

Following error appeare during evaluation (after eval_steps):
AttributeError: 'WhisperForConditionalGeneration' object has no attribute 'language'

Result of the transformers-cli env command:

transformers version: 4.28.1
Platform: Linux-5.10.147+-x86_64-with-glibc2.31
Python version: 3.10.11
Huggingface_hub version: 0.14.1
Safetensors version: not installed
PyTorch version (GPU?): 2.0.0+cu118 (True)
Tensorflow version (GPU?): 2.12.0 (True)
Flax version (CPU?/GPU?/TPU?): 0.6.9 (gpu)
Jax version: 0.4.8
JaxLib version: 0.4.7
Using GPU in script?: (True)

Detailed traceback:

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in :81                                                                            │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1662 in train                    │
│                                                                                                  │
│   1659 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1660 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1661 │   │   )                                                                                 │
│ ❱ 1662 │   │   return inner_training_loop(                                                       │
│   1663 │   │   │   args=args,                                                                    │
│   1664 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1665 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2006 in _inner_training_loop     │
│                                                                                                  │
│   2003 │   │   │   │   │   self.state.epoch = epoch + (step + 1 + steps_skipped) / steps_in_epo  │
│   2004 │   │   │   │   │   self.control = self.callback_handler.on_step_end(args, self.state, s  │
│   2005 │   │   │   │   │                                                                         │
│ ❱ 2006 │   │   │   │   │   self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_k  │
│   2007 │   │   │   │   else:                                                                     │
│   2008 │   │   │   │   │   self.control = self.callback_handler.on_substep_end(args, self.state  │
│   2009                                                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2287 in _maybe_log_save_evaluate │
│                                                                                                  │
│   2284 │   │   │   │   │   )                                                                     │
│   2285 │   │   │   │   │   metrics.update(dataset_metrics)                                       │
│   2286 │   │   │   else:                                                                         │
│ ❱ 2287 │   │   │   │   metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)                 │
│   2288 │   │   │   self._report_to_hp_search(trial, self.state.global_step, metrics)             │
│   2289 │   │                                                                                     │
│   2290 │   │   if self.control.should_save:                                                      │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer_seq2seq.py:159 in evaluate          │
│                                                                                                  │
│   156 │   │   )                                                                                  │
│   157 │   │   self._gen_kwargs = gen_kwargs                                                      │
│   158 │   │                                                                                      │
│ ❱ 159 │   │   return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix   │
│   160 │                                                                                          │
│   161 │   def predict(                                                                           │
│   162 │   │   self,                                                                              │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2993 in evaluate                 │
│                                                                                                  │
│   2990 │   │   start_time = time.time()                                                          │
│   2991 │   │                                                                                     │
│   2992 │   │   eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else se  │
│ ❱ 2993 │   │   output = eval_loop(                                                               │
│   2994 │   │   │   eval_dataloader,                                                              │
│   2995 │   │   │   description="Evaluation",                                                     │
│   2996 │   │   │   # No point gathering the predictions if there are no metrics, otherwise we d  │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:3174 in evaluation_loop          │
│                                                                                                  │
│   3171 │   │   │   │   │   batch_size = observed_batch_size                                      │
│   3172 │   │   │                                                                                 │
│   3173 │   │   │   # Prediction step                                                             │
│ ❱ 3174 │   │   │   loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_o  │
│   3175 │   │   │   inputs_decode = self._prepare_input(inputs["input_ids"]) if args.include_inp  │
│   3176 │   │   │                                                                                 │
│   3177 │   │   │   if is_torch_tpu_available():                                                  │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer_seq2seq.py:271 in prediction_step   │
│                                                                                                  │
│   268 │   │   # TODO (Joao): the following line is needed to keep a consistent result on SQUAD   │
│   269 │   │   # users from preparing a dataset with `decoder_input_ids`.                         │
│   270 │   │   inputs = {k: v for k, v in inputs.items() if k != "decoder_input_ids"}             │
│ ❱ 271 │   │   generated_tokens = self.model.generate(**inputs, **gen_kwargs)                     │
│   272 │   │                                                                                      │
│   273 │   │   # Temporary hack to ensure the generation config is not initialized for each ite   │
│   274 │   │   # TODO: remove this hack when the legacy code that initializes generation_config   │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/whisper/modeling_whisper.py:1576 in  │
│ generate                                                                                         │
│                                                                                                  │
│   1573 │   │   │   │   │   language_token = f"<|{TO_LANGUAGE_CODE[generation_config.language]}|  │
│   1574 │   │   │   │   else:                                                                     │
│   1575 │   │   │   │   │   raise ValueError(                                                     │
│ ❱ 1576 │   │   │   │   │   │   f"Unsupported language: {self.language}. Language should be one   │
│   1577 │   │   │   │   │   │   f" {list(TO_LANGUAGE_CODE.keys()) if generation_config.language   │
│   1578 │   │   │   │   │   )                                                                     │
│   1579 │   │   │   │   forced_decoder_ids.append((1, generation_config.lang_to_id[language_toke  │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1614 in __getattr__           │
│                                                                                                  │
│   1611 │   │   │   modules = self.__dict__['_modules']                                           │
│   1612 │   │   │   if name in modules:                                                           │
│   1613 │   │   │   │   return modules[name]                                                      │
│ ❱ 1614 │   │   raise AttributeError("'{}' object has no attribute '{}'".format(                  │
│   1615 │   │   │   type(self).__name__, name))                                                   │
│   1616 │                                                                                         │
│   1617 │   def __setattr__(self, name: str, value: Union[Tensor, 'Module']) -> None:             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'WhisperForConditionalGeneration' object has no attribute 'language'

Miroo222

May 1, 2023

Capital "C" in my code language="Czech" is probably wrong. I changed it to "czech" and it is working now.

sanchit-gandhi

May 2, 2023

Awesome - glad both the issues have been fixed! Closing as complete - feel free to open a new issue if you find anything that looks wrong

sanchit-gandhi changed discussion status to closed May 2, 2023

mohblnk

May 29, 2023

Hey @Seyfelislem - can you do two things please:

Could you verify that you set the language correctly in your tokenizer and processor (i.e. that you set language="Arabic" and task="transcribe" in both the tokenizer and processor)

Secondly, could you add this line right after you set forced_decoder_ids=None (in this section:
model.generate = partial(model.generate, language="arabic", task="transcribe")
This will now force the model always to predict in Arabic.

Hello there, I was wondering what is the partial object part of? I'd like to use this line but partial is not defined.

Thanks

Seyfelislem

May 31, 2023

Hey @Seyfelislem - can you do two things please:

Could you verify that you set the language correctly in your tokenizer and processor (i.e. that you set language="Arabic" and task="transcribe" in both the tokenizer and processor)

Secondly, could you add this line right after you set forced_decoder_ids=None (in this section:
model.generate = partial(model.generate, language="arabic", task="transcribe")
This will now force the model always to predict in Arabic.
Hello there, I was wondering what is the partial object part of? I'd like to use this line but partial is not defined.

Thanks

Hey @mohblnk ,

You should add this line to import partial :
from functools import partial

For more details about this function, you should check this link :
https://www.geeksforgeeks.org/partial-functions-python

saber3082508039

Feb 7, 2024

Hey @Seyfelislem and @lnpwcd68730 ! Thank you both for reporting this issue. You might be interested in checking out the Whisper leaderboard for finding the most performant fine-tuned Whisper checkpoints in your language: https://huggingface.co/spaces/whisper-event/leaderboard?dataset=mozilla-foundation%2Fcommon_voice_11_0&config=ar&split=test

Good to see that the eval loss is still decreasing (it's pretty easy to overfit with Whisper fine-tuning). For the WER issue, what we can do is save the references and predictions to a .txt file, and inspect them to see what sorts of errors the model is making. To do this, you can amend the compute_metrics function as follows:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    # save references and predictions to a txt file for debugging
    with open('refs_and_preds.txt', 'w') as f:
    for ref, pred in zip(label_str, pred_str)
        f.write(f"Ref: {ref}\n")
        f.write(f"Pred: {pred}\n\n")

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

I had the issue when I finetune the Whisper Medium in chinese and english., after some steps , the wer is 100%.

# wer is normal
Ref: Report a 3 mile final
Pred: Supporting 3 miles final.

Ref: 进跑道28,响箭710.
Pred: 进跑道28,响箭710.

# wer == 100%
Ref: 涌泉192,联系机坪121.8 再见.
Pred:

Ref: Yangtze River 8314, offset 2 miles left of the track, expedite descend and maintain 7200 meters.
Pred:

Ref: 白鹭808, 联系福州进近125.175再见.
Pred:

there is no pred_str,and i print pred.predictions in every eval.

# wer is normal
# pred_ids
[[50258 50259 50359 ... 50257 50257 50257]
 [50258 50259 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]
 ...
 [50258 50259 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]]
# label_ids
[[50258 50259 50359 ... 50257 50257 50257]
 [50258 50259 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]
 ...
 [50258 50259 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]]

# wer == 100%
# pred_ids
[[50258 50257 50257 ... 50257 50257 50257]
 [50258 50257 50257 ... 50257 50257 50257]
 [50258 50257 50257 ... 50257 50257 50257]
 ...
 [50258 50257 50257 ... 50257 50257 50257]
 [50258 50257 50257 ... 50257 50257 50257]
 [50258 50257 50257 ... 50257 50257 50257]]
# label_ids
[[50258 50259 50359 ... 50257 50257 50257]
 [50258 50259 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]
 ...
 [50258 50259 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]]

and i try the finetune model(which wer == 100%),it works very well.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment