dipta007
's Collections
MM-Interleaved: Interleaved Image-Text Generative Modeling via
Multi-modal Feature Synchronizer
Paper
•
2401.10208
•
Published
•
1
ONE-PEACE: Exploring One General Representation Model Toward Unlimited
Modalities
Paper
•
2305.11172
•
Published
•
1
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video
Paper
•
2302.00402
•
Published
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
•
2308.12966
•
Published
•
7
Unified Model for Image, Video, Audio and Language Tasks
Paper
•
2307.16184
•
Published
•
14
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Paper
•
2307.13721
•
Published
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
Paper
•
2309.03895
•
Published
•
13
InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks
Paper
•
2312.14238
•
Published
•
18
MMBench: Is Your Multi-modal Model an All-around Player?
Paper
•
2307.06281
•
Published
•
5
GPT4All: An Ecosystem of Open Source Compressed Language Models
Paper
•
2311.04931
•
Published
•
20
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
•
2403.05525
•
Published
•
39
nvidia/NVLM-D-72B
Image-Text-to-Text
•
Updated
•
9.88k
•
758
Qwen/Qwen2-VL-72B-Instruct-AWQ
Image-Text-to-Text
•
Updated
•
48.2k
•
40
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text
•
Updated
•
2.44M
•
•
981
Qwen/Qwen2-VL-72B-Instruct
Image-Text-to-Text
•
Updated
•
90.3k
•
233
HuggingFaceM4/Idefics3-8B-Llama3
Image-Text-to-Text
•
Updated
•
18.7k
•
257
mistralai/Pixtral-12B-2409
Image-Text-to-Text
•
Updated
•
552
OpenGVLab/InternVL2-8B
Image-Text-to-Text
•
Updated
•
38.7k
•
161
OpenGVLab/InternVL2-4B
Image-Text-to-Text
•
Updated
•
34.2k
•
48
OpenGVLab/InternVL2-Llama3-76B
Image-Text-to-Text
•
Updated
•
66.8k
•
208