OpenGVLab
/

InternVL_2_5_HiCo_R16

Video-Text-to-Text

feature-extraction

Model card Files Files and versions Community

lixinhao commited on 15 days ago

Commit

fc5bc4e

·

verified ·

1 Parent(s): bb71951

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -70,7 +70,7 @@ model-index:
 [\[📜 Tech Report\]](https://arxiv.org/abs/2501.12386)
 <!-- [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
- InternVideo2.5 is a video multimodal large language model (MLLM, built upoon InternVL2.5) enhanced with **long and rich context (LRC) modeling**. It significantly improves upon existing MLLMs by enhancing their ability to perceive fine-grained details and capture long-form temporal structures. We achieve this through dense vision task annotations using direct preference optimization (TPO) and compact spatiotemporal representations via adaptive hierarchical token compression (HiCo). This model is a variant of InternVideo2.5's ablation experiment, built on HiCo technology only (R16 means 16 tokens per frame).

 [\[📜 Tech Report\]](https://arxiv.org/abs/2501.12386)
 <!-- [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
+ InternVideo2.5 is a video multimodal large language model (MLLM, built upoon InternVL2.5) enhanced with **long and rich context (LRC) modeling**. It significantly improves upon existing MLLMs by enhancing their ability to perceive fine-grained details and capture long-form temporal structures. We achieve this through dense vision task annotations using direct preference optimization (TPO) and compact spatiotemporal representations via adaptive hierarchical token compression (HiCo). This model is a variant of InternVideo2.5's ablation experiment, built on HiCo technology only (**R16 means 16 tokens per frame**).