thanks , how to fine tune?
...
Hi there,
Thank you for your interest in the Phi-4-multimodal.
There are some example finetuning script in the repo, for example
https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/sample_finetune_speech.py
https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/sample_finetune_vision.py
I hope you find they are helpful.
Thanks, is this training only the LLM or also the speech adapter? If not, then how to fine tune the speech adapter for a new spoken language?
@SamuelAzran
This example focuses on finetuning the LLM (Speech LoRA) only. If you would like to finetune the speech encoder and adapter for new spoken languages, you may unfreeze the parameters of model.embed_tokens_extend.audio_embed
by setting requires_grad
to True
.
Thank you for your informative and quick response! I will try it.
I found that during evaluation loop inside trainer, the GPU consumption incrementally increases. Maybe the cuda cache or memory is not handled properly. I created my own evaluation loop to override hf's one in case anyone needs
class CustomTrainer(Trainer):
def __init__(self, stopping_criteria_list=None, processor=None,*args, **kwargs):
super().__init__(*args, **kwargs)
self.processor = processor
stop_tokens = ["<|end|>", self.processor.tokenizer.eos_token]
stop_tokens_ids = self.processor.tokenizer(stop_tokens, add_special_tokens=False, padding="longest", return_tensors="pt")["input_ids"]
stop_tokens_ids = stop_tokens_ids.to(f'cuda:0')
self.stop_tokens_ids = stop_tokens_ids
self.stopping_criteria_list=stopping_criteria_list
def evaluation_loop(
self,
dataloader,
description: str,
prediction_loss_only: Optional[bool] = None,
ignore_keys: Optional[List[str]] = None,
metric_key_prefix: str = "eval",
) -> EvalLoopOutput:
"""
Optimized evaluation loop that only runs the model once per input.
"""
model = self.model
processor = self.processor
accelerator = self.accelerator
# Ensure the model is in evaluation mode
model.eval()
all_generated_texts = []
all_labels = []
total_eval_loss = 0
num_eval_steps = 0
# Progress bar for main process only
progress_bar = tqdm(
enumerate(dataloader),
disable=not accelerator.is_local_main_process,
total=len(dataloader),
desc=f"Evaluation ({metric_key_prefix})"
)
for step, inputs in progress_bar:
with torch.no_grad():
# Move inputs to appropriate device
inputs = self._prepare_inputs(inputs)
# Set up stopping criteria for generation
if not self.stopping_criteria_list:
stop_criteria = MultipleTokenBatchStoppingCriteria(
self.stop_tokens_ids,
batch_size=inputs["input_ids"].size(0)
)
self.stopping_criteria_list = StoppingCriteriaList([stop_criteria])
# Run generation with return_dict_in_generate=True to get scores
generation_outputs = model.generate(
**inputs,
eos_token_id=processor.tokenizer.eos_token_id,
max_new_tokens=500,
stopping_criteria=self.stopping_criteria_list,
return_dict_in_generate=True,
output_scores=True,
)
# Calculate loss from the generation outputs' scores
# This is model-dependent and might require adjustments
# if your model doesn't return logits in a compatible format
generated_ids = generation_outputs.sequences
# Get the actual labels for loss calculation
labels = inputs["labels"].detach().clone()
# Process the generated output for evaluation
if hasattr(self.stopping_criteria_list[0], "stop_tokens_idx") and self.stopping_criteria_list:
stop_tokens_idx = self.stopping_criteria_list[0].stop_tokens_idx.reshape(inputs["input_ids"].size(0), -1)[:, 0]
stop_tokens_idx = torch.where(
stop_tokens_idx > 0,
stop_tokens_idx - self.stop_tokens_ids.shape[-1],
generated_ids.shape[-1],
)
generated_text = [
processor.decode(
_pred_ids[inputs["input_ids"].shape[1]:_stop_tokens_idx],
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
for _pred_ids, _stop_tokens_idx in zip(generated_ids, stop_tokens_idx)
]
else:
# Fallback if no stopping criteria with stop_tokens_idx
generated_text = processor.batch_decode(
generated_ids[:, inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
all_generated_texts.extend(generated_text)
# Process labels
labels[labels == -100] = processor.tokenizer.pad_token_id
label_text = processor.batch_decode(
labels,
skip_special_tokens=True
)
# If you have a specific suffix to remove
if hasattr(self, "ANSWER_SUFFIX"):
label_text = [text.rstrip(self.ANSWER_SUFFIX) for text in label_text]
all_labels.extend(label_text)
# Calculate loss using the original inputs
# Run a separate forward pass just for loss calculation
# This is more efficient than two full generate() calls
outputs = model(**inputs)
loss = outputs.loss
# Scale the loss
if accelerator.use_distributed:
loss = loss.mean()
total_eval_loss += loss.detach().float()
# Explicit memory cleanup after each batch
del generated_ids, generated_text, labels, label_text, outputs, generation_outputs
torch.cuda.empty_cache()
gc.collect()
num_eval_steps += 1
# Gather results from all processes if distributed
all_generated_texts = gather_object(all_generated_texts)
all_labels = gather_object(all_labels)
# Compute metrics
cer = CharErrorRate()(all_generated_texts, all_labels)
wer = WordErrorRate()(all_generated_texts, all_labels)
bleu = sacrebleu.corpus_bleu(all_generated_texts, [all_labels])
# Convert tensor metrics to native Python types
metrics = {
f"{metric_key_prefix}_loss": float(total_eval_loss.item() / num_eval_steps),
f"{metric_key_prefix}_cer": float(cer.item()) if isinstance(cer, torch.Tensor) else float(cer),
f"{metric_key_prefix}_wer": float(wer.item()) if isinstance(wer, torch.Tensor) else float(wer),
f"{metric_key_prefix}_bleu": float(bleu.score)
}
# Clean up memory
del all_generated_texts, all_labels
gc.collect()
torch.cuda.empty_cache()
# Required format for the output
return EvalLoopOutput(
predictions=None,
label_ids=None,
metrics=metrics,
num_samples=len(dataloader.dataset)
)
@nguyenbh Can you kindly inform me how to fine tune the "base weight"/text base? Also, if I fine tune the base weights, I'm guessing I need to fine tune both vision and audio loras as well?
It seems that swift also supports the training of phi4-multimodal: https://github.com/modelscope/ms-swift/pull/3350.
@ysdede
I have bumped into this repo https://huggingface.co/ysdede/Phi-4-mm-inst-asr-turkish-3
The results on finetuning for Turkish ASR looks very promissing.
Before Fine-Tuning:
+ WER: 153.84
+ CER: 82.57
After Fine-Tuning:
+ WER: 64.76
+ CER: 29.85
Also the results from
@seastar105
on extending the Phi-4multimodal to Korean ASR and En-Ko speech translation are very promissing
https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor
@nguyenbh ơi, mình muốn continual pretraining Vietnamese corpus (text only) and then finetune (text only) the LLM-base thì có guideline nào không bạn, vì Phi-4 tuyệt đối không có tiếng Việt ! Sau dó sẽ finetune image/voice mixed sau đó. Cám ơn bạn trước.
@nguyenbh
@shtpgshus
Hi, I want to add Persian to Phi-4-multimodal-instruct for text, speech, and vision. I have a few questions:
For text: Is adding extra_special_tokens to the tokenizer enough for Persian, or do I need to retrain the tokenizer?
For speech: Is setting model.embed_tokens_extend.audio_embed.requires_grad = True sufficient, or are other changes to the script needed?
For vision: For fine-tuning on Persian data (e.g., images with Persian text), also, should I unfreeze parts of the model?
Thanks for your help!
Progress Update on Turkish ASR Fine-Tuning
I am now getting 9–20% WER scores by unfreezing the audio encoder. Initially, I experimented with various approaches such as selectively unfreezing only audio-related layers, separating speech LoRA, and storing speech LoRA independently after fine-tuning. However, I still observed some unintended unfreezing of vision-related layers.
After further experimentation, I simplified the approach by unfreezing all relevant layers (see this list) and increasing the learning rate to enhance ASR performance.
For detailed benchmark results, please refer to the results page and explore the finetuning Colab notebook.
@ysdede This is awesome! Thank you for sharing the Turkish finetuning recipe and notebook with the community.
I also see other very cool finetuned models for Korean language tasks from
@junnei
https://huggingface.co/junnei/Phi-4-multimodal-instruct-ko-asr and
@daekeun-ml
https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech
@thusinh1969 We do not release base language model, therefore continual pre-training will be a challenge for Vietnamese. Phi-4-mini pretraining data does contain some Vietnamese, so I would suggest to run SFT training on Vietnamese with lots of high quality data.