thanks , how to fine tune?

#1
by NickyNicky - opened

Hi there,
Thank you for your interest in the Phi-4-multimodal.
There are some example finetuning script in the repo, for example
https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/sample_finetune_speech.py
https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/sample_finetune_vision.py
I hope you find they are helpful.

Thanks, is this training only the LLM or also the speech adapter? If not, then how to fine tune the speech adapter for a new spoken language?

@SamuelAzran This example focuses on finetuning the LLM (Speech LoRA) only. If you would like to finetune the speech encoder and adapter for new spoken languages, you may unfreeze the parameters of model.embed_tokens_extend.audio_embed by setting requires_grad to True.

Thank you for your informative and quick response! I will try it.

I found that during evaluation loop inside trainer, the GPU consumption incrementally increases. Maybe the cuda cache or memory is not handled properly. I created my own evaluation loop to override hf's one in case anyone needs

class CustomTrainer(Trainer):
    def __init__(self, stopping_criteria_list=None, processor=None,*args, **kwargs):
        super().__init__(*args, **kwargs)
        self.processor = processor
        stop_tokens = ["<|end|>", self.processor.tokenizer.eos_token]
        stop_tokens_ids = self.processor.tokenizer(stop_tokens, add_special_tokens=False, padding="longest", return_tensors="pt")["input_ids"]
        stop_tokens_ids = stop_tokens_ids.to(f'cuda:0')
        self.stop_tokens_ids = stop_tokens_ids
        self.stopping_criteria_list=stopping_criteria_list

    def evaluation_loop(
        self,
        dataloader,
        description: str,
        prediction_loss_only: Optional[bool] = None,
        ignore_keys: Optional[List[str]] = None,
        metric_key_prefix: str = "eval",
    ) -> EvalLoopOutput:
        """
        Optimized evaluation loop that only runs the model once per input.
        """
        model = self.model
        processor = self.processor
        accelerator = self.accelerator
        
        # Ensure the model is in evaluation mode
        model.eval()
        
        all_generated_texts = []
        all_labels = []
        total_eval_loss = 0
        num_eval_steps = 0
        
        # Progress bar for main process only
        progress_bar = tqdm(
            enumerate(dataloader),
            disable=not accelerator.is_local_main_process,
            total=len(dataloader),
            desc=f"Evaluation ({metric_key_prefix})"
        )
        
        for step, inputs in progress_bar:
            with torch.no_grad():
                # Move inputs to appropriate device
                inputs = self._prepare_inputs(inputs)
                
                # Set up stopping criteria for generation
                if not self.stopping_criteria_list:
                    stop_criteria = MultipleTokenBatchStoppingCriteria(
                        self.stop_tokens_ids, 
                        batch_size=inputs["input_ids"].size(0)
                    )
                    self.stopping_criteria_list = StoppingCriteriaList([stop_criteria])
                
                # Run generation with return_dict_in_generate=True to get scores
                generation_outputs = model.generate(
                    **inputs,
                    eos_token_id=processor.tokenizer.eos_token_id,
                    max_new_tokens=500,
                    stopping_criteria=self.stopping_criteria_list,
                    return_dict_in_generate=True,
                    output_scores=True,
                )
                
                # Calculate loss from the generation outputs' scores
                # This is model-dependent and might require adjustments
                # if your model doesn't return logits in a compatible format
                generated_ids = generation_outputs.sequences
                
                # Get the actual labels for loss calculation
                labels = inputs["labels"].detach().clone()
                
                # Process the generated output for evaluation
                if hasattr(self.stopping_criteria_list[0], "stop_tokens_idx") and self.stopping_criteria_list:
                    stop_tokens_idx = self.stopping_criteria_list[0].stop_tokens_idx.reshape(inputs["input_ids"].size(0), -1)[:, 0]
                    stop_tokens_idx = torch.where(
                        stop_tokens_idx > 0,
                        stop_tokens_idx - self.stop_tokens_ids.shape[-1],
                        generated_ids.shape[-1],
                    )
                    generated_text = [
                        processor.decode(
                            _pred_ids[inputs["input_ids"].shape[1]:_stop_tokens_idx],
                            skip_special_tokens=True,
                            clean_up_tokenization_spaces=False
                        )
                        for _pred_ids, _stop_tokens_idx in zip(generated_ids, stop_tokens_idx)
                    ]
                else:
                    # Fallback if no stopping criteria with stop_tokens_idx
                    generated_text = processor.batch_decode(
                        generated_ids[:, inputs["input_ids"].shape[1]:],
                        skip_special_tokens=True,
                        clean_up_tokenization_spaces=False
                    )
                
                all_generated_texts.extend(generated_text)
                
                # Process labels
                labels[labels == -100] = processor.tokenizer.pad_token_id
                label_text = processor.batch_decode(
                    labels, 
                    skip_special_tokens=True
                )
                
                # If you have a specific suffix to remove
                if hasattr(self, "ANSWER_SUFFIX"):
                    label_text = [text.rstrip(self.ANSWER_SUFFIX) for text in label_text]
                
                all_labels.extend(label_text)
                
                # Calculate loss using the original inputs
                # Run a separate forward pass just for loss calculation 
                # This is more efficient than two full generate() calls
                outputs = model(**inputs)
                loss = outputs.loss
                
                # Scale the loss
                if accelerator.use_distributed:
                    loss = loss.mean()
                total_eval_loss += loss.detach().float()
                
                # Explicit memory cleanup after each batch
                del generated_ids, generated_text, labels, label_text, outputs, generation_outputs
                torch.cuda.empty_cache()
                gc.collect()
            
            num_eval_steps += 1
        
        # Gather results from all processes if distributed
        all_generated_texts = gather_object(all_generated_texts)
        all_labels = gather_object(all_labels)
        
        # Compute metrics
        cer = CharErrorRate()(all_generated_texts, all_labels)
        wer = WordErrorRate()(all_generated_texts, all_labels)
        bleu = sacrebleu.corpus_bleu(all_generated_texts, [all_labels])
        
        # Convert tensor metrics to native Python types
        metrics = {
            f"{metric_key_prefix}_loss": float(total_eval_loss.item() / num_eval_steps),
            f"{metric_key_prefix}_cer": float(cer.item()) if isinstance(cer, torch.Tensor) else float(cer),
            f"{metric_key_prefix}_wer": float(wer.item()) if isinstance(wer, torch.Tensor) else float(wer),
            f"{metric_key_prefix}_bleu": float(bleu.score)
        }
        
        # Clean up memory
        del all_generated_texts, all_labels
        gc.collect()
        torch.cuda.empty_cache()
        
        # Required format for the output
        return EvalLoopOutput(
            predictions=None,
            label_ids=None,
            metrics=metrics,
            num_samples=len(dataloader.dataset)
        )

@nguyenbh Can you kindly inform me how to fine tune the "base weight"/text base? Also, if I fine tune the base weights, I'm guessing I need to fine tune both vision and audio loras as well?

It seems that swift also supports the training of phi4-multimodal: https://github.com/modelscope/ms-swift/pull/3350.

Microsoft org

@ysdede I have bumped into this repo https://huggingface.co/ysdede/Phi-4-mm-inst-asr-turkish-3
The results on finetuning for Turkish ASR looks very promissing.

Before Fine-Tuning:
+ WER: 153.84
+ CER: 82.57

After Fine-Tuning:
+ WER: 64.76
+ CER: 29.85
Microsoft org

Also the results from @seastar105 on extending the Phi-4multimodal to Korean ASR and En-Ko speech translation are very promissing
https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor

@nguyenbh ơi, mình muốn continual pretraining Vietnamese corpus (text only) and then finetune (text only) the LLM-base thì có guideline nào không bạn, vì Phi-4 tuyệt đối không có tiếng Việt ! Sau dó sẽ finetune image/voice mixed sau đó. Cám ơn bạn trước.

@nguyenbh @shtpgshus
Hi, I want to add Persian to Phi-4-multimodal-instruct for text, speech, and vision. I have a few questions:

For text: Is adding extra_special_tokens to the tokenizer enough for Persian, or do I need to retrain the tokenizer?
For speech: Is setting model.embed_tokens_extend.audio_embed.requires_grad = True sufficient, or are other changes to the script needed?
For vision: For fine-tuning on Persian data (e.g., images with Persian text), also, should I unfreeze parts of the model?
Thanks for your help!

Progress Update on Turkish ASR Fine-Tuning

I am now getting 9–20% WER scores by unfreezing the audio encoder. Initially, I experimented with various approaches such as selectively unfreezing only audio-related layers, separating speech LoRA, and storing speech LoRA independently after fine-tuning. However, I still observed some unintended unfreezing of vision-related layers.

After further experimentation, I simplified the approach by unfreezing all relevant layers (see this list) and increasing the learning rate to enhance ASR performance.

For detailed benchmark results, please refer to the results page and explore the finetuning Colab notebook.

@nguyenbh

@ysdede This is awesome! Thank you for sharing the Turkish finetuning recipe and notebook with the community.

I also see other very cool finetuned models for Korean language tasks from
@junnei https://huggingface.co/junnei/Phi-4-multimodal-instruct-ko-asr and @daekeun-ml https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech

@thusinh1969 We do not release base language model, therefore continual pre-training will be a challenge for Vietnamese. Phi-4-mini pretraining data does contain some Vietnamese, so I would suggest to run SFT training on Vietnamese with lots of high quality data.

Sign up or log in to comment