🩺 PointDetectCount-Qwen2.5-VL-7B-LoRA

Model: SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA
Base model: Qwen/Qwen2.5-VL-7B-Instruct
Library: peft (LoRA)
Paper: arXiv:2505.16647
Code: GitHub - simula/PointDetectCount
Dataset: SimulaMet/MedMultiPoints

📌 Model Summary

PointDetectCount-Qwen2.5-VL-7B-LoRA is a multi-task medical vision-language model fine-tuned using LoRA on top of Qwen2.5-VL-7B-Instruct, a vision-language instruction-following model. This model performs pointing (localization), bounding box detection, and object counting on medical images using natural language prompts and structured JSON outputs.

It is trained on the MedMultiPoints dataset, a multimodal collection of endoscopic and microscopic images with clinical annotations.

🧠 Intended Uses

Medical image localization: Predict spatial locations (points/bounding boxes) of anatomical/clinical findings.
Object counting: Accurately estimate number of objects like polyps, clusters, or cells in medical images.
Instruction-tuned VQA: Accepts natural language queries prompting multimodal image understanding.

This model is designed for research purposes, particularly in medical vision-language modeling, and should not be used directly for clinical diagnosis.

🚀 How to Use

import torch
from PIL import Image

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("/home/sushant/.cache/modelscope/hub/Qwen/Qwen2___5-VL-7B-Instruct")
model = PeftModel.from_pretrained(base_model, "SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA")

image = Image.open("example.jpg").convert("RGB")
prompt = "Return bounding boxes for each polyp in the image and the total count."

inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)

print(processor.batch_decode(outputs, skip_special_tokens=True)[0])

📊 Training Details

Fine-tuning method: LoRA (rank=16)
Frozen components: Vision encoder (ViT)
Trained components: LLM layers (excluding final LM head)
Loss function: Language modeling loss (cross-entropy over tokens)
Format: Instruction → JSON response ({"bbox": [...], "count": n, "points": [...]})
Hardware: Single NVIDIA A100 (80GB)
Epochs: 5
Batch size: 4 (gradient accumulation used)
Learning rate: 2e-4

📁 Repository Structure

create_datasetJSON.py: Converts raw annotations into instruction-response format
evaluate_qwen.py: Parses and evaluates model outputs vs. ground truth
MedMultiPoints-images/: Folder containing the training/validation images

🧪 Evaluation

Each model output is parsed to extract:

Bounding box coordinates
Point coordinates
Object count

The parsed outputs are compared against the ground truth for each modality (GI tract, sperm, clusters, etc.). Accuracy is measured through precision/recall on detection, mean absolute error for counting, and proximity scores for pointing.

🛑 Limitations

Trained only on limited domains (GI endoscopy, microscopy).
Not certified for real-world clinical use.
Output format depends on correct JSON generation—parsing may fail with malformed outputs.

📚 Citation

@article{Gautam2025May,
  author = {Gautam, Sushant and Riegler, Michael A. and Halvorsen, Pål},
  title = {Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models},
  journal = {arXiv},
  year = {2025},
  month = {may},
  eprint = {2505.16647},
  doi = {10.48550/arXiv.2505.16647}
}

🤝 Acknowledgements

Developed by researchers at SimulaMet, Simula Research Laboratory, and OsloMet.
Part of ongoing efforts to enhance instruction-tuned medical VLMs for robust multimodal reasoning.

SimulaMet
/

PointDetectCount-Qwen2.5-VL-7B-LoRA