OpenVINO IR format with Optimum

#4
by Echo9Zulu - opened

Hello!

This work is awesome! Can these models be converted to the OpenVINO IR format? Model cards mention custom quantizations methods that are not discussed in the intel documentation. I am running Arc Gpus with Vulkan drivers for GGUF but need to leverage the intel stack ai dev tools for faster inference. Intel documentation doesnt discuss applying the IR format to models that have already been quantized, only from full precision, which may challenge my abiity to leverage your models since they are in GGUF.

Let me know what you think, and this is awesome work.

Thank you for the compliments.;

RE: OpenVINO IR - can you reply with a link?

RE: Full precision.
To create the full version you need Mergekit and the mergefile + source files.
This will then create a source version you could create and translate into any "quant" or format.

The mergekit file / formula is located here:
https://huggingface.co/jebcarter/psyonic-cetacean-20B

No problem.

Check these out for an overview;

https://docs.openvino.ai/2024/openvino-workflow/model-preparation/convert-model-to-ir.html
https://docs.openvino.ai/2024/openvino-workflow/model-preparation/conversion-parameters.html

TLDR; intel is propping up their gpu ecosystem on a model serving framework meant for production use cases. The goal appears to be drop-in substitutes for popular libraries to enrich the value proposition of switching to intel hardware- hop over to team blue and leave some tech debt at the door. If you use pandas, "Modin" data frames are supposed to be faster on intel hardware without requiring major rewrites.

On CPU only I saw HUGE speedups in my testing with cognitivecomputations/dolphin-2.9.4-llama3.1-8b-fp16 yesterday for a few prompts. Working on synthetic data generation so speedy inference is huge.

Sign up or log in to comment