THUdyh commited on
Commit
fa0bfbd
·
verified ·
1 Parent(s): 8a662c2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +128 -0
README.md ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - THUdyh/Oryx-SFT-Data
5
+ base_model:
6
+ - Qwen/Qwen2.5-7B-Instruct
7
+ pipeline_tag: text-generation
8
+ language:
9
+ - en
10
+ - zh
11
+ ---
12
+ # Oryx-7B
13
+
14
+ ## Model Summary
15
+
16
+ The Oryx models are 7/34B parameter models trained on [Oryx-SFT-Data](https://huggingface.co/datasets/THUdyh/Oryx-SFT-Data), based on Qwen2.5 language model with a context window of 32K tokens.
17
+
18
+ Oryx offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths.
19
+
20
+ - **Repository:** https://github.com/Oryx-mllm/Oryx
21
+ - **Languages:** English, Chinese
22
+ - **Paper:** https://arxiv.org/abs/2409.12961
23
+
24
+ ## Use
25
+
26
+ We provide a simple generation process for using our model. For more details, please refer to our [Github Repo](https://github.com/liuzuyan/oryx)
27
+
28
+ ```
29
+ from oryx.model.builder import load_pretrained_model
30
+ from oryx.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
31
+ from oryx.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
32
+ from oryx.conversation import conv_templates, SeparatorStyle
33
+ from PIL import Image
34
+ import requests
35
+ import copy
36
+ import torch
37
+ import sys
38
+ import warnings
39
+ from decord import VideoReader, cpu
40
+ import numpy as np
41
+
42
+ def load_video(self, video_path, max_frames_num,fps=1,force_sample=False):
43
+ if max_frames_num == 0:
44
+ return np.zeros((1, 336, 336, 3))
45
+ vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
46
+ total_frame_num = len(vr)
47
+ video_time = total_frame_num / vr.get_avg_fps()
48
+ fps = round(vr.get_avg_fps()/fps)
49
+ frame_idx = [i for i in range(0, len(vr), fps)]
50
+ frame_time = [i/fps for i in frame_idx]
51
+ if len(frame_idx) > max_frames_num or force_sample:
52
+ sample_fps = max_frames_num
53
+ uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
54
+ frame_idx = uniform_sampled_frames.tolist()
55
+ frame_time = [i/vr.get_avg_fps() for i in frame_idx]
56
+ frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
57
+ spare_frames = vr.get_batch(frame_idx).asnumpy()
58
+ # import pdb;pdb.set_trace()
59
+ return spare_frames,frame_time,video_time
60
+ pretrained = "THUdyh/Oryx-7B"
61
+ model_name = "oryx_qwen"
62
+ device = "cuda"
63
+ device_map = "auto"
64
+ tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)
65
+ model.eval()
66
+ video_path = ""
67
+ max_frames_num = "64"
68
+ video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
69
+ video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
70
+ video = [video]
71
+ video_data = (video, video)
72
+ input_data = (video_data, (384, 384), "video")
73
+ conv_template = "qwen_1_5"
74
+ question = DEFAULT_IMAGE_TOKEN + "\nPlease describe this video in detail."
75
+ conv = copy.deepcopy(conv_templates[conv_template])
76
+ conv.append_message(conv.roles[0], question)
77
+ conv.append_message(conv.roles[1], None)
78
+ prompt_question = conv.get_prompt()
79
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
80
+ output_ids = model.generate(
81
+ inputs=input_ids,
82
+ images=input_data[0][0],
83
+ images_highres=input_data[0][1],
84
+ modalities=video_data[2],
85
+ do_sample=False,
86
+ temperature=0,
87
+ max_new_tokens=128,
88
+ use_cache=True,
89
+ )
90
+
91
+ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
92
+ print(text_outputs)
93
+ ```
94
+
95
+
96
+ ### Results
97
+
98
+ #### General Video Benchmark
99
+
100
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/652965773a416e1f2173443b/hKfOK0u3OXly_u4hgGLDB.png" alt="image/png" style="zoom: 33%;" />
101
+
102
+ #### Long-Form Video Understanding
103
+
104
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/652965773a416e1f2173443b/Xweq9f4OWkqeVc_FZIMuO.png" alt="image/png" style="zoom:33%;" />
105
+
106
+ #### Common Image Benchmark
107
+
108
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/652965773a416e1f2173443b/ybfroSA9WaKXtJbP_9cLR.png" alt="image/png" style="zoom:33%;" />
109
+
110
+ #### 3D Spatial Understanding
111
+
112
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/652965773a416e1f2173443b/5v8ACRzAoKS0FbcVBXZhT.png" alt="image/png" style="zoom:33%;" />
113
+
114
+
115
+
116
+ ### Model Architecture
117
+
118
+ - **Architecture:** Pre-trained [Oryx-ViT](https://huggingface.co/THUdyh/Oryx-ViT) + Qwen2.5-7B
119
+ - **Data:** a mixture of 1.2M image/video data
120
+ - **Precision:** BFloat16
121
+
122
+ #### Hardware & Software
123
+
124
+ - **Hardware:** 64 * NVIDIA Tesla A100
125
+ - **Orchestration:** HuggingFace Trainer
126
+ - **Code:** Pytorch
127
+
128
+ ## Citation