README.md · microsoft/Phi-4-multimodal-instruct-onnx at 29e40c6d15495631a91b068dab237c74cab6574b

metadata

license: mit
language:
  - multilingual
tags:
  - nlp
  - code
  - audio
  - automatic-speech-recognition
  - speech-summarization
  - speech-translation
  - visual-question-answering
  - phi-4-multimodal
  - phi
  - phi-4-mini

Phi-4 Multimodal Instruct ONNX models

Introduction

This is an ONNX version of the Phi-4 multimodal model to accelerate inference with ONNX Runtime.

This model is quantized to int4 precision and runs on GPU devices.

To run this model with ONNX Runtime:

Download the model:

git clone https://huggingface.co/microsoft/Phi-4-multimodal-instruct-onnx

Download the script to run the model:

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/phi4-mm.py -o phi4-mm.py

Run the script

python phi4-mm.py -m Phi-4-multimodal-instruct-onnx/gpu/gpu-int4-rtn-block-32 -e cuda

You will be prompted to provide any images, audios, and a prompt.

The performance of the text component is similar to the [Phi-4 mini ONNX models] (https://huggingface.co/microsoft/Phi-4-mini-instruct-onnx/blob/main/README.md)

Model Description

Developed by: Microsoft
Model type: ONNX
License: MIT
Model Description: This is a conversion of Phi4 mini model for ONNX Runtime inference.

Disclaimer: Model is only an optimization of the base model, any risk associated with the model is the responsibility of the user of the model. Please verify and test for you scenarios. There may be a slight difference in output from the base model with the optimizations applied.

Base Model

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, and direct preference optimization to support precise instruction adherence and safety measures.

See details here