File size: 4,920 Bytes
fa0bfbd ec69e05 fa0bfbd ec69e05 fa0bfbd ec69e05 5509b12 fa0bfbd 70f5826 fa0bfbd ec69e05 fa0bfbd ec69e05 fa0bfbd ec69e05 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
base_model:
- Qwen/Qwen2.5-7B-Instruct
datasets:
- THUdyh/Oryx-SFT-Data
language:
- en
- zh
license: apache-2.0
pipeline_tag: video-text-to-text
library_name: oryx
---
# Oryx-1.5-7B
## Model Summary
The Oryx-1.5 models are 7/32B parameter models trained on [Oryx-SFT-Data](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data), based on Qwen2.5 language model with a context window of 32K tokens.
Oryx offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths.
- **Repository:** https://github.com/Oryx-mllm/Oryx
- **Project Page:** https://oryx-mllm.github.io
- **Languages:** English, Chinese
- **Paper:** https://arxiv.org/abs/2409.12961
## Use
We provide a simple generation process for using our model. For more details, please refer to our [Github Repo](https://github.com/liuzuyan/oryx)
```python
from oryx.model.builder import load_pretrained_model
from oryx.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from oryx.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from oryx.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
from decord import VideoReader, cpu
import numpy as np
def load_video(self, video_path, max_frames_num,fps=1,force_sample=False):
if max_frames_num == 0:
return np.zeros((1, 336, 336, 3))
vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
total_frame_num = len(vr)
video_time = total_frame_num / vr.get_avg_fps()
fps = round(vr.get_avg_fps()/fps)
frame_idx = [i for i in range(0, len(vr), fps)]
frame_time = [i/fps for i in frame_idx]
if len(frame_idx) > max_frames_num or force_sample:
sample_fps = max_frames_num
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
frame_time = [i/vr.get_avg_fps() for i in frame_idx]
frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
spare_frames = vr.get_batch(frame_idx).asnumpy()
# import pdb;pdb.set_trace()
return spare_frames,frame_time,video_time
pretrained = "THUdyh/Oryx-7B"
model_name = "oryx_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)
model.eval()
video_path = ""
max_frames_num = "64"
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
video = [video]
video_data = (video, video)
input_data = (video_data, (384, 384), "video")
conv_template = "qwen_1_5"
question = DEFAULT_IMAGE_TOKEN + "\nPlease describe this video in detail."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
output_ids = model.generate(
inputs=input_ids,
images=input_data[0][0],
images_highres=input_data[0][1],
modalities=video_data[2],
do_sample=False,
temperature=0,
max_new_tokens=128,
use_cache=True,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)
```
### Results
#### General Video Benchmark
<img src="https://cdn-uploads.huggingface.co/production/uploads/652965773a416e1f2173443b/hKfOK0u3OXly_u4hgGLDB.png" alt="image/png" style="zoom: 33%;" />
#### Long-Form Video Understanding
<img src="https://cdn-uploads.huggingface.co/production/uploads/652965773a416e1f2173443b/Xweq9f4OWkqeVc_FZIMuO.png" alt="image/png" style="zoom:33%;" />
#### Common Image Benchmark
<img src="https://cdn-uploads.huggingface.co/production/uploads/652965773a416e1f2173443b/ybfroSA9WaKXtJbP_9cLR.png" alt="image/png" style="zoom:33%;" />
#### 3D Spatial Understanding
<img src="https://cdn-uploads.huggingface.co/production/uploads/652965773a416e1f2173443b/5v8ACRzAoKS0FbcVBXZhT.png" alt="image/png" style="zoom:33%;" />
### Model Architecture
- **Architecture:** Pre-trained [Oryx-ViT](https://huggingface.co/THUdyh/Oryx-ViT) + Qwen2.5-7B
- **Data:** a mixture of 1.2M image/video data
- **Precision:** BFloat16
#### Hardware & Software
- **Hardware:** 64 * NVIDIA Tesla A100
- **Orchestration:** HuggingFace Trainer
- **Code:** Pytorch
## Citation
```bibtex
@article{liu2024oryx,
title={Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution},
author={Liu, Zuyan and Dong, Yuhao and Liu, Ziwei and Hu, Winston and Lu, Jiwen and Rao, Yongming},
journal={arXiv preprint arXiv:2409.12961},
year={2024}
}
``` |