--- pipeline_tag: image-text-to-text --- # Skywork-R1V

## 🌐 [Homepage](#) | 📖 [Technical Report](https://github.com/SkyworkAI/Skywork-R1V/blob/main/Skywork_R1V.pdf) | 💻 [GitHub](https://github.com/SkyworkAI/Skywork-R1V) --- ## 1. Model Introduction | Model Name | Vision Encoder | Language Model | HF Link | | ---------------------- | -------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- | ------------ | | Skywork-R1V-38B | [InternViT-6B-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) | [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | [🤗 Link](#) | | Skywork-R1V-38B-qwq | [InternViT-6B-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) | [Qwen/QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) | - | ## 2. Feature - **Visual Chain-of-Thought**: Enables multi-step logical reasoning on visual inputs, breaking down complex image-based problems into manageable steps. - **Mathematical & Scientific Analysis**: Capable of solving visual math problems and interpreting scientific/medical imagery with high precision. - **Cross-Modal Understanding**: Seamlessly integrates text and images for richer, context-aware comprehension. ## 3. Evaluation

Comparison with Larger-Scale Open-Source and Closed-Source Models

	Benchmark	LLM	VLM
		QwQ-32B-Preview	InternVL-2.5-38B	VILA 1.5-40B	InternVL2-40B	Skywork-R1V-38B
Reasoning	MATH-500	90.6	-	-	-	94.0
	AIME 2024	50.0	-	-	-	72.0
	GPQA	54.5	-	-	-	61.6
Vision	MathVista(mini)	-	71.9	49.5	63.7	67.5
	MMMU(Val)	-	63.9	55.1	55.2	69.0

Evaluation results of state-of-the-art LLMs and VLMs

	Vision	Reasoning			Vision
		MATH-500	AIME 2024	GPQA	MathVista(mini)	MMMU(Val)
		pass@1	pass@1	pass@1	pass@1	pass@1
Qwen2.5-72B-Instruct	❌	80.0	23.3	49.0	-	-
Deepseek V3	❌	90.2	39.2	59.1	-	-
Deepseek R1	❌	97.3	79.8	71.5	-	-
Claude 3.5 Sonnet	✅	78.3	16.0	65.0	65.3	66.4
GPT-4o	✅	74.6	9.3	49.9	63.8	69.1
Kimi k1.5	✅	96.2	77.5	-	74.9	70.0
Qwen2.5-VL-72B-Instruct	✅	-	-	-	74.8	70.2
LLaVA-Onevision-72B	✅	-	-	-	67.5	56.8
InternVL2-Llama3-76B	✅	-	-	-	65.5	62.7
InternVL2.5-78B	✅	-	-	-	72.3	70.1
Skywork-R1V-38B	✅	94.0	72.0	61.6	67.5	69.0

--- ## 4. Usage ### 1. Clone the Repository ```shell git clone https://github.com/SkyworkAI/Skywork-R1V.git cd skywork-r1v/inference ``` ### 2. Set Up the Environment ```shell conda create -n r1-v python=3.10 conda activate r1-v bash setup.sh ``` ### 3. Run the Inference Script ```shell CUDA_VISIBLE_DEVICES="0,1" python inference_with_transformers.py \ --model_path path \ --image_paths image1_path \ --question "your question" ``` --- ## 5. Citation If you use Skywork-R1V in your research, please cite: ``` @article{skywork2025r1v, title = {Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought}, author = {Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou}, year = {2025}, journal = {https://github.com/SkyworkAI/Skywork-R1V/blob/main/Skywork_R1V.pdf}, url = {https://huggingface.co/Skywork/Skywork-R1V-38B} } ``` *This project is released under an open-source license.*