π Working & tested quants for Qwen2.5 VL 7B.
Made using QuantBench!
Get QuantBench on GitHub:
https://github.com/Independent-AI-Labs/local-super-agents/tree/main/quantbench
The models have been tested on latest llama.cpp built with CLIP hardware acceleration manually enabled!
Consult the following post for more details: https://github.com/ggml-org/llama.cpp/issues/11483#issuecomment-2676422772
For now you can only do single cli calls:
llama-qwen2vl-cli -m ~/gguf/Qwen2.5-VL-7B-Instruct-Q4_0.gguf --mmproj ~/gguf/mmproj-Qwen2.5-VL-7B-Instruct-f32.gguf --n_gpu_layers 9999 -p "Describe the image." --image ~/Pictures/test_small.png
We're working on a wrapper API solution until multimodal support is added back to llama.cpp
API will be published here: https://github.com/Independent-AI-Labs/local-super-agents
Let us know if you need a specific quant!
πͺ Benchmarking Update:
The latest main looks stable with Vulkan CLIP and any model thrown at it so far. Some preliminary insights:
1200x1200 is the maximum you can encode with 16GB of VRAM. clip.cpp does not seem to support multi-GPU Vulkan yet.
You will get an OOM with larger images, so make sure to pre-process accordingly!
A 4060Ti-class GPU delivers 20-30 t/s with the Q8_0 and double that on Q4 @ 16-32K context.
Batching (multiple prompts) in a single cli call seems to be working fine:
llama-qwen2vl-cli --ctx-size 16000 -n 16000 -m ~/gguf/Qwen2.5-VL-7B-Instruct-Q4_0.gguf --mmproj ~/gguf/mmproj-Qwen2.5-VL-7B-Instruct-f32.gguf --n_gpu_layers 9999 -p "Describe the image in detail. Extract all textual information from it. Output as detailed JSON." -p "Analyze the image." --image ~/Pictures/test_small.png --image ~/Pictures/test_small.png
Output quality looks very promising! We'll release all of the benchmark code when ready, so the process can be streamlined for other models.
- Downloads last month
- 3,488
Model tree for IAILabs/Qwen2.5-VL-7b-Instruct-GGUF
Base model
Qwen/Qwen2.5-VL-7B-Instruct