ZeroGUI: Automating Online GUI Learning at Zero Human Cost
AI & ML interests
Computer Vision
Recent Activity
View all activity
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
-
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Paper • 2303.16727 • Published -
OpenGVLab/VideoMAEv2-Base
Video Classification • 0.1B • Updated • 15.8k • 7 -
OpenGVLab/VideoMAEv2-Large
Video Classification • 0.3B • Updated • 5.74k • 1 -
OpenGVLab/VideoMAEv2-Huge
Video Classification • 0.6B • Updated • 77 • 1
Better than InternVL 2.0
-
484
InternVL
⚡Chat with an AI that understands text and images
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Paper • 2412.05271 • Published • 160 -
OpenGVLab/InternVL2_5-78B
Image-Text-to-Text • 78B • Updated • 1.17k • 192 -
OpenGVLab/InternVL2_5-78B-AWQ
Image-Text-to-Text • Updated • 97 • 14
Expanding Performance Boundaries of Open-Source MLLM
Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Paper • 2312.14238 • Published • 20 -
OpenGVLab/InternViT-6B-224px
Image Feature Extraction • Updated • 180 • 24 -
OpenGVLab/InternVL-14B-224px
Image Feature Extraction • 14B • Updated • 807 • 35 -
OpenGVLab/InternVL-Chat-V1-2-Plus
Image-Text-to-Text • 40B • Updated • 50 • 34
Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
InternVideo2
-
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Paper • 2403.15377 • Published • 27 -
OpenGVLab/InternVideo2-Chat-8B
Video-Text-to-Text • 8B • Updated • 400 • 23 -
OpenGVLab/InternVideo2_chat_8B_HD
Video-Text-to-Text • 8B • Updated • 141 • 18 -
OpenGVLab/InternVideo2_Chat_8B_InternLM2_5
Video-Text-to-Text • 9B • Updated • 49 • 7
State Space Model for Efficient Video Understanding
A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
-
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Paper • 2211.05778 • Published -
OpenGVLab/internimage_t_1k_224
Image Classification • 0.0B • Updated • 88 • 1 -
OpenGVLab/internimage_s_1k_224
Image Classification • 0.1B • Updated • 38 • 1 -
OpenGVLab/internimage_b_1k_224
Image Classification • 0.1B • Updated • 635 • 1
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 276 -
OpenGVLab/InternVL3-1B
Image-Text-to-Text • 0.9B • Updated • 69.9k • 65 -
OpenGVLab/InternVL3-2B
Image-Text-to-Text • 2B • Updated • 85.6k • 27 -
OpenGVLab/InternVL3-8B
Image-Text-to-Text • 8B • Updated • 338k • 78
[NeurIPS 2024 Spotlight ] Parameter-Inverted Image Pyramid Networks
-
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
Paper • 2501.07783 • Published • 7 -
OpenGVLab/PIIP
Object Detection • Updated • 5 -
OpenGVLab/PIIP-LLaVA_CLIP-BL_512-256_7B
Image-Text-to-Text • 7B • Updated • 13 -
OpenGVLab/PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B
Image-Text-to-Text • 7B • Updated • 16
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text • 8B • Updated • 9.79k • 71 -
OpenGVLab/InternVL_2_5_HiCo_R16
Video-Text-to-Text • 8B • Updated • 4.86k • 4 -
OpenGVLab/InternVL_2_5_HiCo_R64
Video-Text-to-Text • 8B • Updated • 557 • 3 -
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Paper • 2501.12386 • Published • 1
Faster and more powerful VideoChat.
-
OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448
Video-Text-to-Text • 2B • Updated • 1.46k • 22 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res224
Video-Text-to-Text • 8B • Updated • 70 • 6 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
Video-Text-to-Text • 8B • Updated • 2.34k • 12 -
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Paper • 2501.00574 • Published • 6
Enhancing the Reasoning Ability of MLLMs via Mixed Preference Optimization
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 80 -
OpenGVLab/InternVL2_5-78B-MPO
Image-Text-to-Text • 78B • Updated • 397 • 54 -
OpenGVLab/InternVL2_5-38B-MPO
Image-Text-to-Text • 38B • Updated • 16k • 20 -
OpenGVLab/InternVL2_5-26B-MPO
Image-Text-to-Text • 26B • Updated • 518 • 14
A Pioneering Open-Source Alternative to GPT-4V
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Paper • 2404.16821 • Published • 58 -
OpenGVLab/InternVL-Chat-V1-5
Image-Text-to-Text • 26B • Updated • 6.13k • 410 -
OpenGVLab/InternViT-6B-448px-V1-5
Image Feature Extraction • 6B • Updated • 613 • 78 -
OpenGVLab/InternViT-300M-448px
Image Feature Extraction • 0.3B • Updated • 73.8k • 55
A Pioneering Monolithic MLLM
-
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Paper • 2410.08202 • Published • 4 -
OpenGVLab/Mono-InternVL-2B
Image-Text-to-Text • 3B • Updated • 26.8k • 33 -
OpenGVLab/Mono-InternVL-2B-S1-1
Image-Text-to-Text • 3B • Updated • 12 -
OpenGVLab/Mono-InternVL-2B-S1-2
Image-Text-to-Text • 3B • Updated • 15
Adaptation Models for Specific Domains
-
OpenGVLab/Mini-InternVL2-4B-DA-DriveLM
Image-Text-to-Text • 4B • Updated • 29 • 3 -
OpenGVLab/Mini-InternVL2-4B-DA-Medical
Image-Text-to-Text • 4B • Updated • 53 • 5 -
OpenGVLab/Mini-InternVL2-4B-DA-BDD
Image-Text-to-Text • 4B • Updated • 50 -
OpenGVLab/Mini-InternVL2-2B-DA-DriveLM
Image-Text-to-Text • 2B • Updated • 36
Chat-Centric Video Understanding
A Large-Scale Video-Text Dataset
Improved Baselines with Pyramid Vision Transformer
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 276 -
OpenGVLab/InternVL3-1B
Image-Text-to-Text • 0.9B • Updated • 69.9k • 65 -
OpenGVLab/InternVL3-2B
Image-Text-to-Text • 2B • Updated • 85.6k • 27 -
OpenGVLab/InternVL3-8B
Image-Text-to-Text • 8B • Updated • 338k • 78
[NeurIPS 2024 Spotlight ] Parameter-Inverted Image Pyramid Networks
-
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
Paper • 2501.07783 • Published • 7 -
OpenGVLab/PIIP
Object Detection • Updated • 5 -
OpenGVLab/PIIP-LLaVA_CLIP-BL_512-256_7B
Image-Text-to-Text • 7B • Updated • 13 -
OpenGVLab/PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B
Image-Text-to-Text • 7B • Updated • 16
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text • 8B • Updated • 9.79k • 71 -
OpenGVLab/InternVL_2_5_HiCo_R16
Video-Text-to-Text • 8B • Updated • 4.86k • 4 -
OpenGVLab/InternVL_2_5_HiCo_R64
Video-Text-to-Text • 8B • Updated • 557 • 3 -
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Paper • 2501.12386 • Published • 1
-
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Paper • 2303.16727 • Published -
OpenGVLab/VideoMAEv2-Base
Video Classification • 0.1B • Updated • 15.8k • 7 -
OpenGVLab/VideoMAEv2-Large
Video Classification • 0.3B • Updated • 5.74k • 1 -
OpenGVLab/VideoMAEv2-Huge
Video Classification • 0.6B • Updated • 77 • 1
Faster and more powerful VideoChat.
-
OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448
Video-Text-to-Text • 2B • Updated • 1.46k • 22 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res224
Video-Text-to-Text • 8B • Updated • 70 • 6 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
Video-Text-to-Text • 8B • Updated • 2.34k • 12 -
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Paper • 2501.00574 • Published • 6
Better than InternVL 2.0
-
484
InternVL
⚡Chat with an AI that understands text and images
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Paper • 2412.05271 • Published • 160 -
OpenGVLab/InternVL2_5-78B
Image-Text-to-Text • 78B • Updated • 1.17k • 192 -
OpenGVLab/InternVL2_5-78B-AWQ
Image-Text-to-Text • Updated • 97 • 14
Enhancing the Reasoning Ability of MLLMs via Mixed Preference Optimization
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 80 -
OpenGVLab/InternVL2_5-78B-MPO
Image-Text-to-Text • 78B • Updated • 397 • 54 -
OpenGVLab/InternVL2_5-38B-MPO
Image-Text-to-Text • 38B • Updated • 16k • 20 -
OpenGVLab/InternVL2_5-26B-MPO
Image-Text-to-Text • 26B • Updated • 518 • 14
Expanding Performance Boundaries of Open-Source MLLM
A Pioneering Open-Source Alternative to GPT-4V
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Paper • 2404.16821 • Published • 58 -
OpenGVLab/InternVL-Chat-V1-5
Image-Text-to-Text • 26B • Updated • 6.13k • 410 -
OpenGVLab/InternViT-6B-448px-V1-5
Image Feature Extraction • 6B • Updated • 613 • 78 -
OpenGVLab/InternViT-300M-448px
Image Feature Extraction • 0.3B • Updated • 73.8k • 55
Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Paper • 2312.14238 • Published • 20 -
OpenGVLab/InternViT-6B-224px
Image Feature Extraction • Updated • 180 • 24 -
OpenGVLab/InternVL-14B-224px
Image Feature Extraction • 14B • Updated • 807 • 35 -
OpenGVLab/InternVL-Chat-V1-2-Plus
Image-Text-to-Text • 40B • Updated • 50 • 34
A Pioneering Monolithic MLLM
-
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Paper • 2410.08202 • Published • 4 -
OpenGVLab/Mono-InternVL-2B
Image-Text-to-Text • 3B • Updated • 26.8k • 33 -
OpenGVLab/Mono-InternVL-2B-S1-1
Image-Text-to-Text • 3B • Updated • 12 -
OpenGVLab/Mono-InternVL-2B-S1-2
Image-Text-to-Text • 3B • Updated • 15
Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Adaptation Models for Specific Domains
-
OpenGVLab/Mini-InternVL2-4B-DA-DriveLM
Image-Text-to-Text • 4B • Updated • 29 • 3 -
OpenGVLab/Mini-InternVL2-4B-DA-Medical
Image-Text-to-Text • 4B • Updated • 53 • 5 -
OpenGVLab/Mini-InternVL2-4B-DA-BDD
Image-Text-to-Text • 4B • Updated • 50 -
OpenGVLab/Mini-InternVL2-2B-DA-DriveLM
Image-Text-to-Text • 2B • Updated • 36
InternVideo2
-
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Paper • 2403.15377 • Published • 27 -
OpenGVLab/InternVideo2-Chat-8B
Video-Text-to-Text • 8B • Updated • 400 • 23 -
OpenGVLab/InternVideo2_chat_8B_HD
Video-Text-to-Text • 8B • Updated • 141 • 18 -
OpenGVLab/InternVideo2_Chat_8B_InternLM2_5
Video-Text-to-Text • 9B • Updated • 49 • 7
Chat-Centric Video Understanding
State Space Model for Efficient Video Understanding
A Large-Scale Video-Text Dataset
A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
-
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Paper • 2211.05778 • Published -
OpenGVLab/internimage_t_1k_224
Image Classification • 0.0B • Updated • 88 • 1 -
OpenGVLab/internimage_s_1k_224
Image Classification • 0.1B • Updated • 38 • 1 -
OpenGVLab/internimage_b_1k_224
Image Classification • 0.1B • Updated • 635 • 1
Improved Baselines with Pyramid Vision Transformer