---
pipeline_tag: image-text-to-text
---
# Skywork-R1V
## 🌐 [Homepage](#) | 📖 [Technical Report](https://github.com/SkyworkAI/Skywork-R1V/blob/main/Skywork_R1V.pdf) | 💻 [GitHub](https://github.com/SkyworkAI/Skywork-R1V)
---
## 1. Model Introduction
| Model Name | Vision Encoder | Language Model | HF Link |
| ---------------------- | -------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- | ------------ |
| Skywork-R1V-38B | [InternViT-6B-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) | [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | [🤗 Link](#) |
| Skywork-R1V-38B-qwq | [InternViT-6B-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) | [Qwen/QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) | - |
## 2. Feature
- **Visual Chain-of-Thought**: Enables multi-step logical reasoning on visual inputs, breaking down complex image-based problems into manageable steps.
- **Mathematical & Scientific Analysis**: Capable of solving visual math problems and interpreting scientific/medical imagery with high precision.
- **Cross-Modal Understanding**: Seamlessly integrates text and images for richer, context-aware comprehension.
## 3. Evaluation
Comparison with Larger-Scale Open-Source and Closed-Source Models
|
Benchmark |
LLM |
VLM |
|
|
QwQ-32B-Preview |
InternVL-2.5-38B |
VILA 1.5-40B |
InternVL2-40B |
Skywork-R1V-38B |
Reasoning |
MATH-500 |
90.6 |
- |
- |
- |
94.0 |
AIME 2024 |
50.0 |
- |
- |
- |
72.0 |
GPQA |
54.5 |
- |
- |
- |
61.6 |
Vision |
MathVista(mini) |
- |
71.9 |
49.5 |
63.7 |
67.5 |
MMMU(Val) |
- |
63.9 |
55.1 |
55.2 |
69.0 |
Evaluation results of state-of-the-art LLMs and VLMs
|
Vision |
Reasoning |
Vision |
|
|
MATH-500 |
AIME 2024 |
GPQA |
MathVista(mini) |
MMMU(Val) |
|
|
pass@1 |
pass@1 |
pass@1 |
pass@1 |
pass@1 |
Qwen2.5-72B-Instruct |
❌ |
80.0 |
23.3 |
49.0 |
- |
- |
Deepseek V3 |
❌ |
90.2 |
39.2 |
59.1 |
- |
- |
Deepseek R1 |
❌ |
97.3 |
79.8 |
71.5 |
- |
- |
Claude 3.5 Sonnet |
✅ |
78.3 |
16.0 |
65.0 |
65.3 |
66.4 |
GPT-4o |
✅ |
74.6 |
9.3 |
49.9 |
63.8 |
69.1 |
Kimi k1.5 |
✅ |
96.2 |
77.5 |
- |
74.9 |
70.0 |
Qwen2.5-VL-72B-Instruct |
✅ |
- |
- |
- |
74.8 |
70.2 |
LLaVA-Onevision-72B |
✅ |
- |
- |
- |
67.5 |
56.8 |
InternVL2-Llama3-76B |
✅ |
- |
- |
- |
65.5 |
62.7 |
InternVL2.5-78B |
✅ |
- |
- |
- |
72.3 |
70.1 |
Skywork-R1V-38B |
✅ |
94.0 |
72.0 |
61.6 |
67.5 |
69.0 |
---
## 4. Usage
### 1. Clone the Repository
```shell
git clone https://github.com/SkyworkAI/Skywork-R1V.git
cd skywork-r1v/inference
```
### 2. Set Up the Environment
```shell
conda create -n r1-v python=3.10
conda activate r1-v
bash setup.sh
```
### 3. Run the Inference Script
```shell
CUDA_VISIBLE_DEVICES="0,1" python inference_with_transformers.py \
--model_path path \
--image_paths image1_path \
--question "your question"
```
---
## 5. Citation
If you use Skywork-R1V in your research, please cite:
```
@article{skywork2025r1v,
title = {Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought},
author = {Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou},
year = {2025},
journal = {https://github.com/SkyworkAI/Skywork-R1V/blob/main/Skywork_R1V.pdf},
url = {https://huggingface.co/Skywork/Skywork-R1V-38B}
}
```
*This project is released under an open-source license.*