Spaces:
Running
on
L4
WER = 100% !!
Hello everyone,
I am having an issue when finetuning OpenAI's Whisper Medium on Mozilla's Common Voice 11 Dataset with the Arabic language.
The training and validation loss are both decreasing but the WER is being 100% after some steps (specially when the loss becomes < 1) and I see that the model is performing well and that WER is just miscalculated.
Notes :
- This error is just happening with the medium model, other models (small, tiny, large-v2 ,etc.) are working fine.
- I am following the famous blog about Whisper's finetuning (https://huggingface.co/blog/fine-tune-whisper).
I met the same question
Hey @Seyfelislem and @lnpwcd68730 ! Thank you both for reporting this issue. You might be interested in checking out the Whisper leaderboard for finding the most performant fine-tuned Whisper checkpoints in your language: https://huggingface.co/spaces/whisper-event/leaderboard?dataset=mozilla-foundation%2Fcommon_voice_11_0&config=ar&split=test
Good to see that the eval loss is still decreasing (it's pretty easy to overfit with Whisper fine-tuning). For the WER issue, what we can do is save the references and predictions to a .txt
file, and inspect them to see what sorts of errors the model is making. To do this, you can amend the compute_metrics
function as follows:
def compute_metrics(pred):
pred_ids = pred.predictions
label_ids = pred.label_ids
# replace -100 with the pad_token_id
label_ids[label_ids == -100] = tokenizer.pad_token_id
# we do not want to group tokens when computing the metrics
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
# save references and predictions to a txt file for debugging
with open('refs_and_preds.txt', 'w') as f:
for ref, pred in zip(label_str, pred_str)
f.write(f"Ref: {ref}\n")
f.write(f"Pred: {pred}\n\n")
wer = 100 * metric.compute(predictions=pred_str, references=label_str)
return {"wer": wer}
Hey
@sanchit-gandhi
,
@amyeroberts
Thank you for you answers here and in github.
(I'd like to proceed with my question here on Hugging Face.)
So, I tried your suggestion by modifying compute_metrics
, and it seems that the transcriptions generated by the model are sometimes in the Arabic language (original and buckwalter) and sometimes are translated to an other language (French, Russian, Chinese ,etc) and sometimes even empty!
Here are the results of the transformers-cli env
command:
transformers
version: 4.28.1- Platform: Linux-5.10.147+-x86_64-with-glibc2.31
- Python version: 3.9.16
- Huggingface_hub version: 0.14.1
- Safetensors version: not installed
- PyTorch version (GPU?): 2.0.0+cu118 (True)
- Tensorflow version (GPU?): 2.12.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.6.8 (gpu)
- Jax version: 0.4.8
- JaxLib version: 0.4.7
- Using GPU in script?: (True)
Hey @Seyfelislem - can you do two things please:
- Could you verify that you set the language correctly in your tokenizer and processor (i.e. that you set
language="Arabic"
andtask="transcribe"
in both the tokenizer and processor) - Secondly, could you add this line right after you set
forced_decoder_ids=None
(in this section:
model.generate = partial(model.generate, language="arabic", task="transcribe")
This will now force the model always to predict in Arabic.
Hey again
@sanchit-gandhi
I can ensure that I set the language correctly in the tokenizer and the processor.
Now, after adding the line model.generate = partial(model.generate, language="arabic", task="transcribe")
, there are no more problems with WER.
Thank you very much for your efforts.
Hello @sanchit-gandhi ,
I have same problem with WER approaching to 100% for Czech language.
After I added following two lines:from functools import partial
....
model.generate = partial(model.generate, language="Czech", task="transcribe")
Following error appeare during evaluation (after eval_steps):
AttributeError: 'WhisperForConditionalGeneration' object has no attribute 'language'
Result of the transformers-cli env command:
transformers
version: 4.28.1- Platform: Linux-5.10.147+-x86_64-with-glibc2.31
- Python version: 3.10.11
- Huggingface_hub version: 0.14.1
- Safetensors version: not installed
- PyTorch version (GPU?): 2.0.0+cu118 (True)
- Tensorflow version (GPU?): 2.12.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.6.9 (gpu)
- Jax version: 0.4.8
- JaxLib version: 0.4.7
- Using GPU in script?: (True)
Detailed traceback:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ in :81 โ โ โ โ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1662 in train โ โ โ โ 1659 โ โ inner_training_loop = find_executable_batch_size( โ โ 1660 โ โ โ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size โ โ 1661 โ โ ) โ โ โฑ 1662 โ โ return inner_training_loop( โ โ 1663 โ โ โ args=args, โ โ 1664 โ โ โ resume_from_checkpoint=resume_from_checkpoint, โ โ 1665 โ โ โ trial=trial, โ โ โ โ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2006 in _inner_training_loop โ โ โ โ 2003 โ โ โ โ โ self.state.epoch = epoch + (step + 1 + steps_skipped) / steps_in_epo โ โ 2004 โ โ โ โ โ self.control = self.callback_handler.on_step_end(args, self.state, s โ โ 2005 โ โ โ โ โ โ โ โฑ 2006 โ โ โ โ โ self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_k โ โ 2007 โ โ โ โ else: โ โ 2008 โ โ โ โ โ self.control = self.callback_handler.on_substep_end(args, self.state โ โ 2009 โ โ โ โ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2287 in _maybe_log_save_evaluate โ โ โ โ 2284 โ โ โ โ โ ) โ โ 2285 โ โ โ โ โ metrics.update(dataset_metrics) โ โ 2286 โ โ โ else: โ โ โฑ 2287 โ โ โ โ metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) โ โ 2288 โ โ โ self._report_to_hp_search(trial, self.state.global_step, metrics) โ โ 2289 โ โ โ โ 2290 โ โ if self.control.should_save: โ โ โ โ /usr/local/lib/python3.10/dist-packages/transformers/trainer_seq2seq.py:159 in evaluate โ โ โ โ 156 โ โ ) โ โ 157 โ โ self._gen_kwargs = gen_kwargs โ โ 158 โ โ โ โ โฑ 159 โ โ return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix โ โ 160 โ โ โ 161 โ def predict( โ โ 162 โ โ self, โ โ โ โ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2993 in evaluate โ โ โ โ 2990 โ โ start_time = time.time() โ โ 2991 โ โ โ โ 2992 โ โ eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else se โ โ โฑ 2993 โ โ output = eval_loop( โ โ 2994 โ โ โ eval_dataloader, โ โ 2995 โ โ โ description="Evaluation", โ โ 2996 โ โ โ # No point gathering the predictions if there are no metrics, otherwise we d โ โ โ โ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:3174 in evaluation_loop โ โ โ โ 3171 โ โ โ โ โ batch_size = observed_batch_size โ โ 3172 โ โ โ โ โ 3173 โ โ โ # Prediction step โ โ โฑ 3174 โ โ โ loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_o โ โ 3175 โ โ โ inputs_decode = self._prepare_input(inputs["input_ids"]) if args.include_inp โ โ 3176 โ โ โ โ โ 3177 โ โ โ if is_torch_tpu_available(): โ โ โ โ /usr/local/lib/python3.10/dist-packages/transformers/trainer_seq2seq.py:271 in prediction_step โ โ โ โ 268 โ โ # TODO (Joao): the following line is needed to keep a consistent result on SQUAD โ โ 269 โ โ # users from preparing a dataset with `decoder_input_ids`. โ โ 270 โ โ inputs = {k: v for k, v in inputs.items() if k != "decoder_input_ids"} โ โ โฑ 271 โ โ generated_tokens = self.model.generate(**inputs, **gen_kwargs) โ โ 272 โ โ โ โ 273 โ โ # Temporary hack to ensure the generation config is not initialized for each ite โ โ 274 โ โ # TODO: remove this hack when the legacy code that initializes generation_config โ โ โ โ /usr/local/lib/python3.10/dist-packages/transformers/models/whisper/modeling_whisper.py:1576 in โ โ generate โ โ โ โ 1573 โ โ โ โ โ language_token = f"<|{TO_LANGUAGE_CODE[generation_config.language]}| โ โ 1574 โ โ โ โ else: โ โ 1575 โ โ โ โ โ raise ValueError( โ โ โฑ 1576 โ โ โ โ โ โ f"Unsupported language: {self.language}. Language should be one โ โ 1577 โ โ โ โ โ โ f" {list(TO_LANGUAGE_CODE.keys()) if generation_config.language โ โ 1578 โ โ โ โ โ ) โ โ 1579 โ โ โ โ forced_decoder_ids.append((1, generation_config.lang_to_id[language_toke โ โ โ โ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1614 in __getattr__ โ โ โ โ 1611 โ โ โ modules = self.__dict__['_modules'] โ โ 1612 โ โ โ if name in modules: โ โ 1613 โ โ โ โ return modules[name] โ โ โฑ 1614 โ โ raise AttributeError("'{}' object has no attribute '{}'".format( โ โ 1615 โ โ โ type(self).__name__, name)) โ โ 1616 โ โ โ 1617 โ def __setattr__(self, name: str, value: Union[Tensor, 'Module']) -> None: โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ AttributeError: 'WhisperForConditionalGeneration' object has no attribute 'language'
Capital "C" in my code language="Czech"
is probably wrong. I changed it to "czech" and it is working now.
Awesome - glad both the issues have been fixed! Closing as complete - feel free to open a new issue if you find anything that looks wrong
Hey @Seyfelislem - can you do two things please:
- Could you verify that you set the language correctly in your tokenizer and processor (i.e. that you set
language="Arabic"
andtask="transcribe"
in both the tokenizer and processor)- Secondly, could you add this line right after you set
forced_decoder_ids=None
(in this section:
model.generate = partial(model.generate, language="arabic", task="transcribe")
This will now force the model always to predict in Arabic.
Hello there, I was wondering what is the partial
object part of? I'd like to use this line but partial is not defined.
Thanks
Hey @Seyfelislem - can you do two things please:
- Could you verify that you set the language correctly in your tokenizer and processor (i.e. that you set
language="Arabic"
andtask="transcribe"
in both the tokenizer and processor)- Secondly, could you add this line right after you set
forced_decoder_ids=None
(in this section:
model.generate = partial(model.generate, language="arabic", task="transcribe")
This will now force the model always to predict in Arabic.
Hello there, I was wondering what is the
partial
object part of? I'd like to use this line but partial is not defined.Thanks
Hey @mohblnk ,
You should add this line to import partial
:from functools import partial
For more details about this function, you should check this link :
https://www.geeksforgeeks.org/partial-functions-python
Hey @Seyfelislem and @lnpwcd68730 ! Thank you both for reporting this issue. You might be interested in checking out the Whisper leaderboard for finding the most performant fine-tuned Whisper checkpoints in your language: https://huggingface.co/spaces/whisper-event/leaderboard?dataset=mozilla-foundation%2Fcommon_voice_11_0&config=ar&split=test
Good to see that the eval loss is still decreasing (it's pretty easy to overfit with Whisper fine-tuning). For the WER issue, what we can do is save the references and predictions to a
.txt
file, and inspect them to see what sorts of errors the model is making. To do this, you can amend thecompute_metrics
function as follows:
def compute_metrics(pred): pred_ids = pred.predictions label_ids = pred.label_ids # replace -100 with the pad_token_id label_ids[label_ids == -100] = tokenizer.pad_token_id # we do not want to group tokens when computing the metrics pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True) label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True) # save references and predictions to a txt file for debugging with open('refs_and_preds.txt', 'w') as f: for ref, pred in zip(label_str, pred_str) f.write(f"Ref: {ref}\n") f.write(f"Pred: {pred}\n\n") wer = 100 * metric.compute(predictions=pred_str, references=label_str) return {"wer": wer}
I had the issue when I finetune the Whisper Medium in chinese and english., after some steps , the wer is 100%.
# wer is normal
Ref: Report a 3 mile final
Pred: Supporting 3 miles final.
Ref: ่ฟ่ท้28,ๅ็ฎญ710.
Pred: ่ฟ่ท้28,ๅ็ฎญ710.
# wer == 100%
Ref: ๆถๆณ192,่็ณปๆบๅช121.8 ๅ่ง.
Pred:
Ref: Yangtze River 8314, offset 2 miles left of the track, expedite descend and maintain 7200 meters.
Pred:
Ref: ็ฝ้นญ808, ่็ณป็ฆๅท่ฟ่ฟ125.175ๅ่ง.
Pred:
there is no pred_str,and i print pred.predictions in every eval.
# wer is normal
# pred_ids
[[50258 50259 50359 ... 50257 50257 50257]
[50258 50259 50359 ... 50257 50257 50257]
[50258 50260 50359 ... 50257 50257 50257]
...
[50258 50259 50359 ... 50257 50257 50257]
[50258 50260 50359 ... 50257 50257 50257]
[50258 50260 50359 ... 50257 50257 50257]]
# label_ids
[[50258 50259 50359 ... 50257 50257 50257]
[50258 50259 50359 ... 50257 50257 50257]
[50258 50260 50359 ... 50257 50257 50257]
...
[50258 50259 50359 ... 50257 50257 50257]
[50258 50260 50359 ... 50257 50257 50257]
[50258 50260 50359 ... 50257 50257 50257]]
# wer == 100%
# pred_ids
[[50258 50257 50257 ... 50257 50257 50257]
[50258 50257 50257 ... 50257 50257 50257]
[50258 50257 50257 ... 50257 50257 50257]
...
[50258 50257 50257 ... 50257 50257 50257]
[50258 50257 50257 ... 50257 50257 50257]
[50258 50257 50257 ... 50257 50257 50257]]
# label_ids
[[50258 50259 50359 ... 50257 50257 50257]
[50258 50259 50359 ... 50257 50257 50257]
[50258 50260 50359 ... 50257 50257 50257]
...
[50258 50259 50359 ... 50257 50257 50257]
[50258 50260 50359 ... 50257 50257 50257]
[50258 50260 50359 ... 50257 50257 50257]]
and i try the finetune model(which wer == 100%),it works very well.