---
library_name: transformers
tags: []
---
# Model Card for Video-LLaVa
## Model Details
**Model type:**
Video-LLaVA is an open-source multomodal model trained by fine-tuning LLM on multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.
Base LLM: [lmsys/vicuna-13b-v1.5](https://huggingface.co/lmsys/vicuna-13b-v1.5)
**Model Description:**
The model can generate interleaving images and videos, despite the absence of image-video pairs in the dataset. Video-LLaVa is uses an encoder trained for unified visual representation through alignment prior to projection.
Extensive experiments demonstrate the complementarity of modalities, showcasing significant superiority when compared to models specifically designed for either images or videos.
VideoLLaVa example. Taken from the original paper.
**Paper or resources for more information:**
https://github.com/PKU-YuanGroup/Video-LLaVA
## ๐๏ธ Training Dataset
- The images pretraining dataset is from [LLaVA](https://github.com/haotian-liu/LLaVA).
- The images tuning dataset is from [LLaVA](https://github.com/haotian-liu/LLaVA).
- The videos pretraining dataset is from [Valley](https://github.com/RupertLuo/Valley).
- The videos tuning dataset is from [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT).
## How to Get Started with the Model
Use the code below to get started with the model.
```python
from PIL import Image
import requests
import numpy as np
import av
from huggingface_hub import hf_hub_download
from transformers import VideoLlavaProcessor, VideoLlavaForConditionalGeneration
def read_video_pyav(container, indices):
'''
Decode the video with PyAV decoder.
Args:
container (av.container.input.InputContainer): PyAV container.
indices (List[int]): List of frame indices to decode.
Returns:
np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
'''
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")
processor = VideoLlavaProcessor.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")
prompt = "USER: