πŸ‘” Working & tested quants for Qwen2.5 VL 7B.

Made using QuantBench!

Get QuantBench on GitHub: https://github.com/Independent-AI-Labs/local-super-agents/tree/main/quantbench

The models have been tested on latest llama.cpp built with CLIP hardware acceleration manually enabled!

Consult the following post for more details: https://github.com/ggml-org/llama.cpp/issues/11483#issuecomment-2676422772

For now you can only do single cli calls:

llama-qwen2vl-cli -m ~/gguf/Qwen2.5-VL-7B-Instruct-Q4_0.gguf --mmproj ~/gguf/mmproj-Qwen2.5-VL-7B-Instruct-f32.gguf --n_gpu_layers 9999 -p "Describe the image." --image ~/Pictures/test_small.png

We're working on a wrapper API solution until multimodal support is added back to llama.cpp

API will be published here: https://github.com/Independent-AI-Labs/local-super-agents

Let us know if you need a specific quant!

πŸ’ͺ Benchmarking Update:

The latest main looks stable with Vulkan CLIP and any model thrown at it so far. Some preliminary insights:

  • 1200x1200 is the maximum you can encode with 16GB of VRAM. clip.cpp does not seem to support multi-GPU Vulkan yet.

    You will get an OOM with larger images, so make sure to pre-process accordingly!

  • A 4060Ti-class GPU delivers 20-30 t/s with the Q8_0 and double that on Q4 @ 16-32K context.

  • Batching (multiple prompts) in a single cli call seems to be working fine:

llama-qwen2vl-cli --ctx-size 16000 -n 16000 -m ~/gguf/Qwen2.5-VL-7B-Instruct-Q4_0.gguf --mmproj ~/gguf/mmproj-Qwen2.5-VL-7B-Instruct-f32.gguf --n_gpu_layers 9999 -p "Describe the image in detail. Extract all textual information from it. Output as detailed JSON." -p "Analyze the image." --image ~/Pictures/test_small.png --image ~/Pictures/test_small.png

Output quality looks very promising! We'll release all of the benchmark code when ready, so the process can be streamlined for other models.

Downloads last month
3,488
GGUF
Model size
7.62B params
Architecture
qwen2vl

4-bit

8-bit

16-bit

32-bit

Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for IAILabs/Qwen2.5-VL-7b-Instruct-GGUF

Quantized
(20)
this model