kimyoungjune commited on
Commit
f3bb353
1 Parent(s): 6f7fb38

Upload 20 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,223 @@
1
  ---
 
 
 
2
  license: cc-by-nc-4.0
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ - ko
5
  license: cc-by-nc-4.0
6
+ tags:
7
+ - multimodal
8
+ - conversational
9
+ - ncsoft
10
+ - varco
11
+ base_model:
12
+ - Qwen/Qwen2.5-14B-Instruct
13
+ - google/siglip-so400m-patch14-384
14
+ library_name: transformers
15
  ---
16
+
17
+ # VARCO-VISION-14B
18
+
19
+ ## About the Model
20
+
21
+ **VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM) developed through four distinct training phases, culminating in a final preference optimization stage. Designed to excel in both multimodal and text-only tasks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models. The model currently accepts a single image and accompanying text as input, generating text as output. It supports grounding—the ability to identify the locations of objects within an image—as well as OCR (Optical Character Recognition) to recognize text within images.
22
+
23
+ - **Developed by:** NC Research, Multimodal Generation Team
24
+ - **Technical Report:** [Coming Soon]()
25
+ - **Languages:** Korean, English
26
+ - **License:** CC BY-NC 4.0
27
+ - **Architecture:** VARCO-VISION-14B follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
28
+ - **Base Model:**
29
+ - **Language Model:** [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
30
+ - **Vision Encoder:** [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
31
+
32
+
33
+
34
+ ## Uses
35
+
36
+ ### Direct Use
37
+
38
+ To load VARCO-VISION-14B, start by cloning and installing **LLaVA-NeXT**:
39
+
40
+ ```bash
41
+ git clone https://github.com/LLaVA-VL/LLaVA-NeXT
42
+ cd LLaVA-NeXT
43
+ pip install -e ".[train]"
44
+ ```
45
+
46
+ After installing **LLaVA-NeXT**, you can load VARCO-VISION-14B using the following code:
47
+
48
+ ```python
49
+ import torch
50
+ from transformers import AutoTokenizer
51
+ from llava.model.language_model.llava_qwen import LlavaQwenForCausalLM
52
+ from llava.mm_utils import tokenizer_image_token, process_images
53
+
54
+ model_name = "NCSOFT/VARCO-VISION-14B"
55
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
56
+ model = LlavaQwenForCausalLM.from_pretrained(
57
+ model_name,
58
+ torch_dtype=torch.float16,
59
+ attn_implementation="flash_attention_2",
60
+ low_cpu_mem_usage=True,
61
+ device_map="auto"
62
+ )
63
+
64
+ vision_tower = model.get_vision_tower()
65
+ image_processor = vision_tower.image_processor
66
+ ```
67
+
68
+ Prepare the image and text input by preprocessing the image and tokenizing the text. Pass the processed inputs to the model to generate predictions.
69
+
70
+ ```python
71
+ import requests
72
+ from PIL import Image
73
+
74
+ # Define a chat history and use `apply_chat_template` to get correctly formatted prompt
75
+ # Each value in "content" has to be a list of dicts with types ("text", "image")
76
+ conversation = [
77
+ {
78
+ "role": "user",
79
+ "content": [
80
+ {"type": "text", "text": "Describe this image."},
81
+ {"type": "image"},
82
+ ],
83
+ },
84
+ ]
85
+ prompt = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
86
+
87
+ IMAGE_TOKEN_INDEX = -200
88
+ EOS_TOKEN = "<|im_end|>"
89
+ input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
90
+ input_ids = input_ids.unsqueeze(0).to(model.device)
91
+
92
+ image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
93
+ raw_image = Image.open(requests.get(image_url, stream=True).raw)
94
+ image_tensors = process_images([raw_image], image_processor, model.config)
95
+ image_tensors = [image_tensor.half().to(model.device) for image_tensor in image_tensors]
96
+ image_sizes = [raw_image.size]
97
+
98
+ with torch.inference_mode():
99
+ output_ids = model.generate(
100
+ input_ids,
101
+ images=image_tensors,
102
+ image_sizes=image_sizes,
103
+ do_sample=False,
104
+ max_new_tokens=1024,
105
+ use_cache=True,
106
+ )
107
+
108
+ outputs = tokenizer.batch_decode(output_ids)[0]
109
+ if outputs.endswith(EOS_TOKEN):
110
+ outputs = outputs[: -len(EOS_TOKEN)]
111
+
112
+ outputs = outputs.strip()
113
+ print(outputs)
114
+ ```
115
+
116
+ ### Specialized Features
117
+
118
+ To receive questions or answers based on bounding boxes (e.g., grounding, referring, OCR tasks), include special tokens in the input text.
119
+
120
+ The following special tokens are used to define specific tasks, inputs and outputs for the model:
121
+
122
+ - `<gro>`: Indicates that the model's response should include bounding box information.
123
+ - `<ocr>`: Specifies OCR tasks for recognizing text within an image.
124
+ - `<char>` and `</char>`: Used to mark a text phrase.
125
+ - `<obj>` and `</obj>`: Used to indicate an object.
126
+ - `<bbox>` and `</bbox>`: Used to represent a bounding box.
127
+ - `<delim>`: Represents multiple location points for a single object or text.
128
+
129
+ #### Grounding
130
+
131
+ Grounding refers to the task where the model identifies specific locations within an image to provide an answer. To perform grounding, prepend the special token `<gro>` to the question.
132
+
133
+ ```python
134
+ conversation = [
135
+ {
136
+ "role": "user",
137
+ "content": [
138
+ {"type": "text", "text": "<gro>\nDescribe the image in detail."},
139
+ {"type": "image"},
140
+ ],
141
+ },
142
+ ]
143
+ ```
144
+
145
+ **Expected Output Example:**
146
+ ```html
147
+ The image shows <obj>two cats</obj><bbox>0.521, 0.049, 0.997, 0.783<delim>0.016, 0.108, 0.512, 0.99</bbox> lying on <obj>a pink blanket</obj><bbox>0.002, 0.231, 0.999, 0.999</bbox>. The cat on the left is lying on its side with its head resting on the blanket and its body stretched out. The cat on the right is lying on its back with its paws stretched out and its head turned to the side. Both cats appear relaxed and comfortable. There are also <obj>two remote controls</obj><bbox>0.039, 0.138, 0.283, 0.257<delim>0.508, 0.166, 0.581, 0.295</bbox> placed near the cats, one on each side of them.
148
+ ```
149
+
150
+ <img src="assets/grounding.png" alt="Grounding Example" width="400"/>
151
+
152
+ #### Referring
153
+
154
+ VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, structure the conversation by including the object of interest within `<obj>` and `</obj>` tags and specifying its location with `<bbox>` and `</bbox>` tags. This allows the model to understand the context and focus on the object at the specified location.
155
+
156
+ ```python
157
+ conversation = [
158
+ {
159
+ "role": "user",
160
+ "content": [
161
+ {
162
+ "type": "text",
163
+ "text": "<obj>이 물건</obj><bbox>0.039, 0.138, 0.283, 0.257</bbox>은 어떻게 쓰는거야?",
164
+ },
165
+ {"type": "image"},
166
+ ],
167
+ },
168
+ ]
169
+ ```
170
+
171
+ **Expected Output Example:**
172
+ ```
173
+ **이 물건**은 리모컨으로, 주로 텔레비전이나 다른 전자 기기를 원격으로 조작하는 데 사용됩니다. 버튼을 누르면 채널 변경, 볼륨 조절, 전원 켜기/끄기 등의 기능을 수행할 수 있습니다. 리모컨의 버튼에는 일반적으로 숫자, 메뉴, 설정, 재생/일시정지 등의 기능이 포함되어 있으며, 사용자는 이를 통해 손쉽게 기기를 제어할 수 있습니다.
174
+ ```
175
+
176
+ #### OCR
177
+
178
+ To perform Optical Character Recognition (OCR), use the `<ocr>` token.
179
+
180
+ ```python
181
+ image_file = "./assets/ocr_1.png"
182
+
183
+ conversation = [
184
+ {
185
+ "role": "user",
186
+ "content": [
187
+ {"type": "text", "text": "<ocr>"},
188
+ {"type": "image"},
189
+ ],
190
+ },
191
+ ]
192
+ ```
193
+
194
+ **Expected Output Example:**
195
+
196
+ ```
197
+ <char>백범로</char><bbox>0.172, 0.265, 0.328, 0.34</bbox>
198
+ <char>124번길</char><bbox>0.349, 0.265, 0.512, 0.34</bbox>
199
+ <char>Baekbeom-ro</char><bbox>0.171, 0.335, 0.432, 0.391</bbox>
200
+ <char>124</char><bbox>0.444, 0.34, 0.508, 0.391</bbox>
201
+ <char>만수주공아파트</char><bbox>0.109, 0.528, 0.335, 0.594</bbox>
202
+ <char>시흥</char><bbox>0.443, 0.516, 0.522, 0.578</bbox>
203
+ <char>시청</char><bbox>0.711, 0.521, 0.811, 0.594</bbox>
204
+ <char>Mansu</char><bbox>0.103, 0.601, 0.181, 0.647</bbox>
205
+ <char>Jugong</char><bbox>0.186, 0.601, 0.273, 0.658</bbox>
206
+ <char>Apt</char><bbox>0.281, 0.601, 0.327, 0.651</bbox>
207
+ <char>42</char><bbox>0.377, 0.601, 0.416, 0.647</bbox>
208
+ <char>Shieung</char><bbox>0.445, 0.578, 0.53, 0.623</bbox>
209
+ <char>인천대공원</char><bbox>0.431, 0.623, 0.609, 0.684</bbox>
210
+ <char>모래내시장역</char><bbox>0.651, 0.591, 0.873, 0.664</bbox>
211
+ <char>IncheonGrand</char><bbox>0.433, 0.684, 0.561, 0.723</bbox>
212
+ <char>Park</char><bbox>0.564, 0.684, 0.611, 0.723</bbox>
213
+ ```
214
+
215
+ <img src="assets/ocr_2.jpg" alt="OCR Example" width="350"/>
216
+
217
+ ## Citing the Model
218
+
219
+ (*bibtex will be updated soon..*) If you use VARCO-VISION-14B in your research, please cite the following:
220
+
221
+ ```bibtex
222
+
223
+ ```
added_tokens.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</bbox>": 151673,
3
+ "</char>": 151669,
4
+ "</obj>": 151671,
5
+ "</tool_call>": 151658,
6
+ "<bbox>": 151672,
7
+ "<char>": 151668,
8
+ "<delim>": 151674,
9
+ "<gro>": 151666,
10
+ "<image>": 151675,
11
+ "<obj>": 151670,
12
+ "<ocr>": 151667,
13
+ "<tool_call>": 151657,
14
+ "<|box_end|>": 151649,
15
+ "<|box_start|>": 151648,
16
+ "<|endoftext|>": 151643,
17
+ "<|file_sep|>": 151664,
18
+ "<|fim_middle|>": 151660,
19
+ "<|fim_pad|>": 151662,
20
+ "<|fim_prefix|>": 151659,
21
+ "<|fim_suffix|>": 151661,
22
+ "<|im_end|>": 151645,
23
+ "<|im_start|>": 151644,
24
+ "<|image_pad|>": 151655,
25
+ "<|object_ref_end|>": 151647,
26
+ "<|object_ref_start|>": 151646,
27
+ "<|quad_end|>": 151651,
28
+ "<|quad_start|>": 151650,
29
+ "<|repo_name|>": 151663,
30
+ "<|video_pad|>": 151656,
31
+ "<|vision_end|>": 151653,
32
+ "<|vision_pad|>": 151654,
33
+ "<|vision_start|>": 151652,
34
+ "[UNK]": 151665
35
+ }
chat_template.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "chat_template": "{% if messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{% else %}{{ '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}{% endif %}{% for message in messages %}{{ '<|im_start|>' + message['role'] + '\n' }}{# Render all images first #}{% for content in message['content'] | selectattr('type', 'equalto', 'image') %}{{ '<image>\n' }}{% endfor %}{# Render all text next #}{% if message['role'] != 'assistant' %}{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}{{ content['text'] }}{% endfor %}{% else %}{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}{% generation %}{{ content['text'] }}{% endgeneration %}{% endfor %}{% endif %}{{ '<|im_end|>\n' }}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
3
+ }
config.json ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlavaOnevisionForConditionalGeneration"
4
+ ],
5
+ "image_grid_pinpoints": [
6
+ [
7
+ 384,
8
+ 384
9
+ ],
10
+ [
11
+ 384,
12
+ 768
13
+ ],
14
+ [
15
+ 384,
16
+ 1152
17
+ ],
18
+ [
19
+ 384,
20
+ 1536
21
+ ],
22
+ [
23
+ 384,
24
+ 1920
25
+ ],
26
+ [
27
+ 384,
28
+ 2304
29
+ ],
30
+ [
31
+ 768,
32
+ 384
33
+ ],
34
+ [
35
+ 768,
36
+ 768
37
+ ],
38
+ [
39
+ 768,
40
+ 1152
41
+ ],
42
+ [
43
+ 768,
44
+ 1536
45
+ ],
46
+ [
47
+ 768,
48
+ 1920
49
+ ],
50
+ [
51
+ 768,
52
+ 2304
53
+ ],
54
+ [
55
+ 1152,
56
+ 384
57
+ ],
58
+ [
59
+ 1152,
60
+ 768
61
+ ],
62
+ [
63
+ 1152,
64
+ 1152
65
+ ],
66
+ [
67
+ 1152,
68
+ 1536
69
+ ],
70
+ [
71
+ 1152,
72
+ 1920
73
+ ],
74
+ [
75
+ 1152,
76
+ 2304
77
+ ],
78
+ [
79
+ 1536,
80
+ 384
81
+ ],
82
+ [
83
+ 1536,
84
+ 768
85
+ ],
86
+ [
87
+ 1536,
88
+ 1152
89
+ ],
90
+ [
91
+ 1536,
92
+ 1536
93
+ ],
94
+ [
95
+ 1536,
96
+ 1920
97
+ ],
98
+ [
99
+ 1536,
100
+ 2304
101
+ ],
102
+ [
103
+ 1920,
104
+ 384
105
+ ],
106
+ [
107
+ 1920,
108
+ 768
109
+ ],
110
+ [
111
+ 1920,
112
+ 1152
113
+ ],
114
+ [
115
+ 1920,
116
+ 1536
117
+ ],
118
+ [
119
+ 1920,
120
+ 1920
121
+ ],
122
+ [
123
+ 1920,
124
+ 2304
125
+ ],
126
+ [
127
+ 2304,
128
+ 384
129
+ ],
130
+ [
131
+ 2304,
132
+ 768
133
+ ],
134
+ [
135
+ 2304,
136
+ 1152
137
+ ],
138
+ [
139
+ 2304,
140
+ 1536
141
+ ],
142
+ [
143
+ 2304,
144
+ 1920
145
+ ],
146
+ [
147
+ 2304,
148
+ 2304
149
+ ]
150
+ ],
151
+ "image_token_index": 151675,
152
+ "model_type": "llava_onevision",
153
+ "projector_hidden_act": "gelu",
154
+ "text_config": {
155
+ "_name_or_path": "Qwen/Qwen2.5-14B-Instruct",
156
+ "architectures": [
157
+ "Qwen2ForCausalLM"
158
+ ],
159
+ "bos_token_id": 151643,
160
+ "eos_token_id": 151645,
161
+ "hidden_size": 5120,
162
+ "intermediate_size": 13824,
163
+ "max_window_layers": 70,
164
+ "model_type": "qwen2",
165
+ "num_attention_heads": 40,
166
+ "num_hidden_layers": 48,
167
+ "num_key_value_heads": 8,
168
+ "rope_theta": 1000000.0,
169
+ "torch_dtype": "bfloat16",
170
+ "vocab_size": 151680
171
+ },
172
+ "tie_word_embeddings": false,
173
+ "torch_dtype": "float16",
174
+ "transformers_version": "4.47.0.dev0",
175
+ "use_image_newline_parameter": true,
176
+ "video_token_index": 151647,
177
+ "vision_aspect_ratio": "anyres_max_9",
178
+ "vision_config": {
179
+ "hidden_size": 1152,
180
+ "image_size": 384,
181
+ "intermediate_size": 4304,
182
+ "model_type": "siglip_vision_model",
183
+ "num_attention_heads": 16,
184
+ "num_hidden_layers": 26,
185
+ "patch_size": 14,
186
+ "vision_use_head": false
187
+ },
188
+ "vision_feature_layer": -1,
189
+ "vision_feature_select_strategy": "full"
190
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 151643,
4
+ "eos_token_id": 151645,
5
+ "transformers_version": "4.47.0.dev0"
6
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c56966c66fabd98c18f4dbf7aa39d52d39e3f82440c6aafc9179e27eabb21a70
3
+ size 4882570368
model-00002-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:98d9a2b11900d13fa7af14b59a7d3da82c6604fcd4d858b1d994ce9f9eaf059b
3
+ size 4954848840
model-00003-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3827bd36339c072a3a86b7aee5a47bbd48f5728fc3e24c148c325a64f46f7e59
3
+ size 4954848904
model-00004-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:daa9cd2e696234f1bd02c0a3c874065202bc2c896aebf3574539412c4ec2a24b
3
+ size 4954848904
model-00005-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a4790292eeb4e156e5f22a34b45499cdd60cbd00885e47dde644cd772009f6e4
3
+ size 4954848904
model-00006-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b6f0414ce51a8db1080653fdef8faee9a1bf987ae7f3c75f2bb8e9b3f4b08061
3
+ size 4136918232
model-00007-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:085b7df07c04471a0dc8d8014a82585624b71f2c66b4d0d815a64437b67e964f
3
+ size 1553203344
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
preprocessor_config.json ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_rgb": true,
3
+ "do_normalize": true,
4
+ "do_pad": true,
5
+ "do_rescale": true,
6
+ "do_resize": true,
7
+ "image_grid_pinpoints": [
8
+ [
9
+ 384,
10
+ 384
11
+ ],
12
+ [
13
+ 384,
14
+ 768
15
+ ],
16
+ [
17
+ 384,
18
+ 1152
19
+ ],
20
+ [
21
+ 384,
22
+ 1536
23
+ ],
24
+ [
25
+ 384,
26
+ 1920
27
+ ],
28
+ [
29
+ 384,
30
+ 2304
31
+ ],
32
+ [
33
+ 768,
34
+ 384
35
+ ],
36
+ [
37
+ 768,
38
+ 768
39
+ ],
40
+ [
41
+ 768,
42
+ 1152
43
+ ],
44
+ [
45
+ 768,
46
+ 1536
47
+ ],
48
+ [
49
+ 768,
50
+ 1920
51
+ ],
52
+ [
53
+ 768,
54
+ 2304
55
+ ],
56
+ [
57
+ 1152,
58
+ 384
59
+ ],
60
+ [
61
+ 1152,
62
+ 768
63
+ ],
64
+ [
65
+ 1152,
66
+ 1152
67
+ ],
68
+ [
69
+ 1152,
70
+ 1536
71
+ ],
72
+ [
73
+ 1152,
74
+ 1920
75
+ ],
76
+ [
77
+ 1152,
78
+ 2304
79
+ ],
80
+ [
81
+ 1536,
82
+ 384
83
+ ],
84
+ [
85
+ 1536,
86
+ 768
87
+ ],
88
+ [
89
+ 1536,
90
+ 1152
91
+ ],
92
+ [
93
+ 1536,
94
+ 1536
95
+ ],
96
+ [
97
+ 1536,
98
+ 1920
99
+ ],
100
+ [
101
+ 1536,
102
+ 2304
103
+ ],
104
+ [
105
+ 1920,
106
+ 384
107
+ ],
108
+ [
109
+ 1920,
110
+ 768
111
+ ],
112
+ [
113
+ 1920,
114
+ 1152
115
+ ],
116
+ [
117
+ 1920,
118
+ 1536
119
+ ],
120
+ [
121
+ 1920,
122
+ 1920
123
+ ],
124
+ [
125
+ 1920,
126
+ 2304
127
+ ],
128
+ [
129
+ 2304,
130
+ 384
131
+ ],
132
+ [
133
+ 2304,
134
+ 768
135
+ ],
136
+ [
137
+ 2304,
138
+ 1152
139
+ ],
140
+ [
141
+ 2304,
142
+ 1536
143
+ ],
144
+ [
145
+ 2304,
146
+ 1920
147
+ ],
148
+ [
149
+ 2304,
150
+ 2304
151
+ ]
152
+ ],
153
+ "image_mean": [
154
+ 0.5,
155
+ 0.5,
156
+ 0.5
157
+ ],
158
+ "image_processor_type": "LlavaOnevisionImageProcessor",
159
+ "image_std": [
160
+ 0.5,
161
+ 0.5,
162
+ 0.5
163
+ ],
164
+ "processor_class": "LlavaOnevisionProcessor",
165
+ "resample": 3,
166
+ "rescale_factor": 0.00392156862745098,
167
+ "size": {
168
+ "height": 384,
169
+ "width": 384
170
+ }
171
+ }
processor_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "image_token": "<image>",
3
+ "num_image_tokens": 729,
4
+ "processor_class": "LlavaOnevisionProcessor",
5
+ "video_token": "<video>",
6
+ "vision_feature_select_strategy": "full"
7
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ {
4
+ "content": "<gro>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ {
11
+ "content": "<ocr>",
12
+ "lstrip": false,
13
+ "normalized": false,
14
+ "rstrip": false,
15
+ "single_word": false
16
+ },
17
+ {
18
+ "content": "<char>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ {
25
+ "content": "</char>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ },
31
+ {
32
+ "content": "<obj>",
33
+ "lstrip": false,
34
+ "normalized": false,
35
+ "rstrip": false,
36
+ "single_word": false
37
+ },
38
+ {
39
+ "content": "</obj>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false
44
+ },
45
+ {
46
+ "content": "<bbox>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false
51
+ },
52
+ {
53
+ "content": "</bbox>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false
58
+ },
59
+ {
60
+ "content": "<delim>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false
65
+ }
66
+ ],
67
+ "eos_token": {
68
+ "content": "<|im_end|>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false
73
+ },
74
+ "pad_token": {
75
+ "content": "<|endoftext|>",
76
+ "lstrip": false,
77
+ "normalized": false,
78
+ "rstrip": false,
79
+ "single_word": false
80
+ },
81
+ "unk_token": {
82
+ "content": "[UNK]",
83
+ "lstrip": false,
84
+ "normalized": false,
85
+ "rstrip": false,
86
+ "single_word": false
87
+ }
88
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e74a2d4fb77a18dd57a5880efa6689e2691246f5424642359b731bc1a9e41657
3
+ size 11423909
tokenizer_config.json ADDED
@@ -0,0 +1,293 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "[UNK]",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": true
188
+ },
189
+ "151666": {
190
+ "content": "<gro>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": true
196
+ },
197
+ "151667": {
198
+ "content": "<ocr>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": true
204
+ },
205
+ "151668": {
206
+ "content": "<char>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": true
212
+ },
213
+ "151669": {
214
+ "content": "</char>",
215
+ "lstrip": false,
216
+ "normalized": false,
217
+ "rstrip": false,
218
+ "single_word": false,
219
+ "special": true
220
+ },
221
+ "151670": {
222
+ "content": "<obj>",
223
+ "lstrip": false,
224
+ "normalized": false,
225
+ "rstrip": false,
226
+ "single_word": false,
227
+ "special": true
228
+ },
229
+ "151671": {
230
+ "content": "</obj>",
231
+ "lstrip": false,
232
+ "normalized": false,
233
+ "rstrip": false,
234
+ "single_word": false,
235
+ "special": true
236
+ },
237
+ "151672": {
238
+ "content": "<bbox>",
239
+ "lstrip": false,
240
+ "normalized": false,
241
+ "rstrip": false,
242
+ "single_word": false,
243
+ "special": true
244
+ },
245
+ "151673": {
246
+ "content": "</bbox>",
247
+ "lstrip": false,
248
+ "normalized": false,
249
+ "rstrip": false,
250
+ "single_word": false,
251
+ "special": true
252
+ },
253
+ "151674": {
254
+ "content": "<delim>",
255
+ "lstrip": false,
256
+ "normalized": false,
257
+ "rstrip": false,
258
+ "single_word": false,
259
+ "special": true
260
+ },
261
+ "151675": {
262
+ "content": "<image>",
263
+ "lstrip": false,
264
+ "normalized": false,
265
+ "rstrip": false,
266
+ "single_word": false,
267
+ "special": true
268
+ }
269
+ },
270
+ "additional_special_tokens": [
271
+ "<gro>",
272
+ "<ocr>",
273
+ "<char>",
274
+ "</char>",
275
+ "<obj>",
276
+ "</obj>",
277
+ "<bbox>",
278
+ "</bbox>",
279
+ "<delim>"
280
+ ],
281
+ "bos_token": null,
282
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
283
+ "clean_up_tokenization_spaces": false,
284
+ "eos_token": "<|im_end|>",
285
+ "errors": "replace",
286
+ "extra_special_tokens": {},
287
+ "model_max_length": 131072,
288
+ "pad_token": "<|endoftext|>",
289
+ "processor_class": "LlavaOnevisionProcessor",
290
+ "split_special_tokens": false,
291
+ "tokenizer_class": "Qwen2Tokenizer",
292
+ "unk_token": "[UNK]"
293
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff