base_model: | |
- Qwen/Qwen2-VL-2B-Instruct | |
datasets: | |
- rp-yu/VPT_Datasets | |
language: | |
- en | |
library_name: transformers | |
license: apache-2.0 | |
metrics: | |
- accuracy | |
pipeline_tag: image-text-to-text | |
# Introducing Visual Perception Token into Multimodal Large Language Model | |
This repository contains models based on the paper [Introducing Visual Perception Token into Multimodal Large Language Model](https://arxiv.org/abs/2502.17425). These models utilize Visual Perception Tokens to enhance the visual perception capabilities of multimodal large language models (MLLMs). | |
Code: https://github.com/yu-rp/VisualPerceptionToken |