metadata

library_name: transformers
tags:
  - comics
license: cc-by-sa-4.0
datasets:
  - VLR-CVC/ComicsPAP
language:
  - en
base_model:
  - Qwen/Qwen2.5-VL-7B-Instruct

Lora Fine-Tune of Qwen2.5-VL-3B-Instruct on ComicsPAP datataset

Qwen2.5-VL-7B-Instruct fine-tunined simultaneously in all five tasks of the ComicsPAP dataset. The training was performed using a constant learning rate of 2e-4 with the AdamW optimizer. The model was trained for 5k steps using an effective batch size of 128. The LoRA configuration employed an α of 16, a dropout rate of 0.05, and a rank r = 8.

Results

Model	Repo	Sequence Filling (%)	Character Coherence (%)	Visual Closure (%)	Text Closure (%)	Caption Relevance (%)	Total (%)
Random		20.22	50.00	14.41	25.00	25.00	24.30
Qwen2.5-VL-3B (Zero-Shot)	Qwen/Qwen2.5-VL-3B-Instruct	27.48	48.95	21.33	27.41	32.82	29.61
Qwen2.5-VL-7B (Zero-Shot)	Qwen/Qwen2.5-VL-7B-Instruct	30.53	54.55	22.00	37.45	40.84	34.91
Qwen2.5-VL-72B (Zero-Shot)	Qwen/Qwen2.5-VL-72B-Instruct	46.88	53.84	23.66	55.60	38.17	41.27
Qwen2.5-VL-3B (Lora Fine-Tuned)	VLR-CVC/Qwen2.5-VL-3B-Instruct-lora-ComicsPAP	62.21	93.01	42.33	63.71	35.49	55.55
Qwen2.5-VL-7B (Lora Fine-Tuned)	VLR-CVC/Qwen2.5-VL-7B-Instruct-lora-ComicsPAP	69.08	93.01	42.00	74.90	49.62	62.31

Citation

BibTeX:

@misc{vivoli2025comicspap,
      title={ComicsPAP: understanding comic strips by picking the correct panel}, 
      author={Emanuele Vivoli and Artemis Llabrés and Mohamed Ali Soubgui and Marco Bertini and Ernest Valveny Llobet and Dimosthenis Karatzas},
      year={2025},
      eprint={2503.08561},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.08561}, 
}

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}