--- pipeline_tag: image-text-to-text --- # Skywork-R1V
Introduction Image
## 🌐 [Homepage](#) | 📖 [Technical Report](https://github.com/SkyworkAI/Skywork-R1V/blob/main/Skywork_R1V.pdf) | 💻 [GitHub](https://github.com/SkyworkAI/Skywork-R1V) --- ## 1. Model Introduction | Model Name | Vision Encoder | Language Model | HF Link | | ---------------------- | -------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- | ------------ | | Skywork-R1V-38B | [InternViT-6B-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) | [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | [🤗 Link](#) | | Skywork-R1V-38B-qwq | [InternViT-6B-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) | [Qwen/QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) | - | ## 2. Feature - **Visual Chain-of-Thought**: Enables multi-step logical reasoning on visual inputs, breaking down complex image-based problems into manageable steps. - **Mathematical & Scientific Analysis**: Capable of solving visual math problems and interpreting scientific/medical imagery with high precision. - **Cross-Modal Understanding**: Seamlessly integrates text and images for richer, context-aware comprehension. ## 3. Evaluation

Comparison with Larger-Scale Open-Source and Closed-Source Models
Benchmark LLM VLM
QwQ-32B-Preview InternVL-2.5-38B VILA 1.5-40B InternVL2-40B Skywork-R1V-38B
Reasoning MATH-500 90.6 - - - 94.0
AIME 2024 50.0 - - - 72.0
GPQA 54.5 - - - 61.6
Vision MathVista(mini) - 71.9 49.5 63.7 67.5
MMMU(Val) - 63.9 55.1 55.2 69.0


Evaluation results of state-of-the-art LLMs and VLMs
Vision Reasoning Vision
MATH-500 AIME 2024 GPQA MathVista(mini) MMMU(Val)
pass@1 pass@1 pass@1 pass@1 pass@1
Qwen2.5-72B-Instruct 80.0 23.3 49.0 - -
Deepseek V3 90.2 39.2 59.1 - -
Deepseek R1 97.3 79.8 71.5 - -
Claude 3.5 Sonnet 78.3 16.0 65.0 65.3 66.4
GPT-4o 74.6 9.3 49.9 63.8 69.1
Kimi k1.5 96.2 77.5 - 74.9 70.0
Qwen2.5-VL-72B-Instruct - - - 74.8 70.2
LLaVA-Onevision-72B - - - 67.5 56.8
InternVL2-Llama3-76B - - - 65.5 62.7
InternVL2.5-78B - - - 72.3 70.1
Skywork-R1V-38B 94.0 72.0 61.6 67.5 69.0
skywork_r1v_eval
--- ## 4. Usage ### 1. Clone the Repository ```shell git clone https://github.com/SkyworkAI/Skywork-R1V.git cd skywork-r1v/inference ``` ### 2. Set Up the Environment ```shell conda create -n r1-v python=3.10 conda activate r1-v bash setup.sh ``` ### 3. Run the Inference Script ```shell CUDA_VISIBLE_DEVICES="0,1" python inference_with_transformers.py \ --model_path path \ --image_paths image1_path \ --question "your question" ``` --- ## 5. Citation If you use Skywork-R1V in your research, please cite: ``` @article{skywork2025r1v, title = {Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought}, author = {Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou}, year = {2025}, journal = {https://github.com/SkyworkAI/Skywork-R1V/blob/main/Skywork_R1V.pdf}, url = {https://huggingface.co/Skywork/Skywork-R1V-38B} } ``` *This project is released under an open-source license.*