PEFT
Safetensors

🩺 PointDetectCount-Qwen2.5-VL-7B-LoRA

Model: SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA
Base model: Qwen/Qwen2.5-VL-7B-Instruct
Library: peft (LoRA)
Paper: arXiv:2505.16647
Code: GitHub - simula/PointDetectCount
Dataset: SimulaMet/MedMultiPoints


πŸ“Œ Model Summary

PointDetectCount-Qwen2.5-VL-7B-LoRA is a multi-task medical vision-language model fine-tuned using LoRA on top of Qwen2.5-VL-7B-Instruct, a vision-language instruction-following model. This model performs pointing (localization), bounding box detection, and object counting on medical images using natural language prompts and structured JSON outputs.

It is trained on the MedMultiPoints dataset, a multimodal collection of endoscopic and microscopic images with clinical annotations.


🧠 Intended Uses

  • Medical image localization: Predict spatial locations (points/bounding boxes) of anatomical/clinical findings.
  • Object counting: Accurately estimate number of objects like polyps, clusters, or cells in medical images.
  • Instruction-tuned VQA: Accepts natural language queries prompting multimodal image understanding.

This model is designed for research purposes, particularly in medical vision-language modeling, and should not be used directly for clinical diagnosis.


πŸš€ How to Use

import torch
from PIL import Image

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("/home/sushant/.cache/modelscope/hub/Qwen/Qwen2___5-VL-7B-Instruct")
model = PeftModel.from_pretrained(base_model, "SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA")

image = Image.open("example.jpg").convert("RGB")
prompt = "Return bounding boxes for each polyp in the image and the total count."

inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)

print(processor.batch_decode(outputs, skip_special_tokens=True)[0])

πŸ“Š Training Details

  • Fine-tuning method: LoRA (rank=16)
  • Frozen components: Vision encoder (ViT)
  • Trained components: LLM layers (excluding final LM head)
  • Loss function: Language modeling loss (cross-entropy over tokens)
  • Format: Instruction β†’ JSON response ({"bbox": [...], "count": n, "points": [...]})
  • Hardware: Single NVIDIA A100 (80GB)
  • Epochs: 5
  • Batch size: 4 (gradient accumulation used)
  • Learning rate: 2e-4

πŸ“ Repository Structure

  • create_datasetJSON.py: Converts raw annotations into instruction-response format
  • evaluate_qwen.py: Parses and evaluates model outputs vs. ground truth
  • MedMultiPoints-images/: Folder containing the training/validation images

πŸ§ͺ Evaluation

Each model output is parsed to extract:

  • Bounding box coordinates
  • Point coordinates
  • Object count

The parsed outputs are compared against the ground truth for each modality (GI tract, sperm, clusters, etc.). Accuracy is measured through precision/recall on detection, mean absolute error for counting, and proximity scores for pointing.


πŸ›‘ Limitations

  • Trained only on limited domains (GI endoscopy, microscopy).
  • Not certified for real-world clinical use.
  • Output format depends on correct JSON generationβ€”parsing may fail with malformed outputs.

πŸ“š Citation

@article{Gautam2025May,
  author = {Gautam, Sushant and Riegler, Michael A. and Halvorsen, PΓ₯l},
  title = {Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models},
  journal = {arXiv},
  year = {2025},
  month = {may},
  eprint = {2505.16647},
  doi = {10.48550/arXiv.2505.16647}
}

🀝 Acknowledgements

Developed by researchers at SimulaMet, Simula Research Laboratory, and OsloMet.
Part of ongoing efforts to enhance instruction-tuned medical VLMs for robust multimodal reasoning.

Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA

Adapter
(79)
this model