kimyoungjune commited on 29 days ago

Commit

f3bb353

•

1 Parent(s): 6f7fb38

Upload 20 files

Browse files

Files changed (21) hide show

.gitattributes +1 -0
README.md +220 -0
added_tokens.json +35 -0
chat_template.json +3 -0
config.json +190 -0
generation_config.json +6 -0
merges.txt +0 -0
model-00001-of-00007.safetensors +3 -0
model-00002-of-00007.safetensors +3 -0
model-00003-of-00007.safetensors +3 -0
model-00004-of-00007.safetensors +3 -0
model-00005-of-00007.safetensors +3 -0
model-00006-of-00007.safetensors +3 -0
model-00007-of-00007.safetensors +3 -0
model.safetensors.index.json +0 -0
preprocessor_config.json +171 -0
processor_config.json +7 -0
special_tokens_map.json +88 -0
tokenizer.json +3 -0
tokenizer_config.json +293 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,223 @@
 ---
 license: cc-by-nc-4.0
 ---

 ---
+language:
+- en
+- ko
 license: cc-by-nc-4.0
+tags:
+- multimodal
+- conversational
+- ncsoft
+- varco
+base_model:
+- Qwen/Qwen2.5-14B-Instruct
+- google/siglip-so400m-patch14-384
+library_name: transformers
 ---
+# VARCO-VISION-14B
+## About the Model
+**VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM) developed through four distinct training phases, culminating in a final preference optimization stage. Designed to excel in both multimodal and text-only tasks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models. The model currently accepts a single image and accompanying text as input, generating text as output. It supports grounding—the ability to identify the locations of objects within an image—as well as OCR (Optical Character Recognition) to recognize text within images.
+- **Developed by:** NC Research, Multimodal Generation Team
+- **Technical Report:** [Coming Soon]()
+- **Languages:** Korean, English
+- **License:** CC BY-NC 4.0
+- **Architecture:** VARCO-VISION-14B follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
+- **Base Model:**
+  - **Language Model:** [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
+  - **Vision Encoder:** [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
+## Uses
+### Direct Use
+To load VARCO-VISION-14B, start by cloning and installing **LLaVA-NeXT**:
+```bash
+git clone https://github.com/LLaVA-VL/LLaVA-NeXT
+cd LLaVA-NeXT
+pip install -e ".[train]"
+```
+After installing **LLaVA-NeXT**, you can load VARCO-VISION-14B using the following code:
+```python
+import torch
+from transformers import AutoTokenizer
+from llava.model.language_model.llava_qwen import LlavaQwenForCausalLM
+from llava.mm_utils import tokenizer_image_token, process_images
+model_name = "NCSOFT/VARCO-VISION-14B"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = LlavaQwenForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.float16,
+    attn_implementation="flash_attention_2",
+    low_cpu_mem_usage=True,
+    device_map="auto"
+)
+vision_tower = model.get_vision_tower()
+image_processor = vision_tower.image_processor
+```
+Prepare the image and text input by preprocessing the image and tokenizing the text. Pass the processed inputs to the model to generate predictions.
+```python
+import requests
+from PIL import Image
+# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
+# Each value in "content" has to be a list of dicts with types ("text", "image")
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "Describe this image."},
+            {"type": "image"},
+        ],
+    },
+]
+prompt = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
+IMAGE_TOKEN_INDEX = -200
+EOS_TOKEN = "<|im_end|>"
+input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
+input_ids = input_ids.unsqueeze(0).to(model.device)
+image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+raw_image = Image.open(requests.get(image_url, stream=True).raw)
+image_tensors = process_images([raw_image], image_processor, model.config)
+image_tensors = [image_tensor.half().to(model.device) for image_tensor in image_tensors]
+image_sizes = [raw_image.size]
+with torch.inference_mode():
+    output_ids = model.generate(
+        input_ids,
+        images=image_tensors,
+        image_sizes=image_sizes,
+        do_sample=False,
+        max_new_tokens=1024,
+        use_cache=True,
+    )
+outputs = tokenizer.batch_decode(output_ids)[0]
+if outputs.endswith(EOS_TOKEN):
+    outputs = outputs[: -len(EOS_TOKEN)]
+outputs = outputs.strip()
+print(outputs)
+```
+### Specialized Features
+To receive questions or answers based on bounding boxes (e.g., grounding, referring, OCR tasks), include special tokens in the input text.
+The following special tokens are used to define specific tasks, inputs and outputs for the model:
+- `<gro>`: Indicates that the model's response should include bounding box information.
+- `<ocr>`: Specifies OCR tasks for recognizing text within an image.
+- `<char>` and `</char>`: Used to mark a text phrase.
+- `<obj>` and `</obj>`: Used to indicate an object.
+- `<bbox>` and `</bbox>`: Used to represent a bounding box.
+- `<delim>`: Represents multiple location points for a single object or text.
+#### Grounding
+Grounding refers to the task where the model identifies specific locations within an image to provide an answer. To perform grounding, prepend the special token `<gro>` to the question.
+```python
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "<gro>\nDescribe the image in detail."},
+            {"type": "image"},
+        ],
+    },
+]
+```
+**Expected Output Example:**
+```html
+The image shows <obj>two cats</obj><bbox>0.521, 0.049, 0.997, 0.783<delim>0.016, 0.108, 0.512, 0.99</bbox> lying on <obj>a pink blanket</obj><bbox>0.002, 0.231, 0.999, 0.999</bbox>. The cat on the left is lying on its side with its head resting on the blanket and its body stretched out. The cat on the right is lying on its back with its paws stretched out and its head turned to the side. Both cats appear relaxed and comfortable. There are also <obj>two remote controls</obj><bbox>0.039, 0.138, 0.283, 0.257<delim>0.508, 0.166, 0.581, 0.295</bbox> placed near the cats, one on each side of them.
+```
+<img src="assets/grounding.png" alt="Grounding Example" width="400"/>
+#### Referring
+VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, structure the conversation by including the object of interest within `<obj>` and `</obj>` tags and specifying its location with `<bbox>` and `</bbox>` tags. This allows the model to understand the context and focus on the object at the specified location.
+```python
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "text",
+                "text": "<obj>이 물건</obj><bbox>0.039, 0.138, 0.283, 0.257</bbox>은 어떻게 쓰는거야?",
+            },
+            {"type": "image"},
+        ],
+    },
+]
+```
+**Expected Output Example:**
+```
+**이 물건**은 리모컨으로, 주로 텔레비전이나 다른 전자 기기를 원격으로 조작하는 데 사용됩니다. 버튼을 누르면 채널 변경, 볼륨 조절, 전원 켜기/끄기 등의 기능을 수행할 수 있습니다. 리모컨의 버튼에는 일반적으로 숫자, 메뉴, 설정, 재생/일시정지 등의 기능이 포함되어 있으며, 사용자는 이를 통해 손쉽게 기기를 제어할 수 있습니다.
+```
+#### OCR
+To perform Optical Character Recognition (OCR), use the `<ocr>` token.
+```python
+image_file = "./assets/ocr_1.png"
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "<ocr>"},
+            {"type": "image"},
+        ],
+    },
+]
+```
+**Expected Output Example:**
+```
+<char>백범로</char><bbox>0.172, 0.265, 0.328, 0.34</bbox>
+<char>124번길</char><bbox>0.349, 0.265, 0.512, 0.34</bbox>
+<char>Baekbeom-ro</char><bbox>0.171, 0.335, 0.432, 0.391</bbox>
+<char>124</char><bbox>0.444, 0.34, 0.508, 0.391</bbox>
+<char>만수주공아파트</char><bbox>0.109, 0.528, 0.335, 0.594</bbox>
+<char>시흥</char><bbox>0.443, 0.516, 0.522, 0.578</bbox>
+<char>시청</char><bbox>0.711, 0.521, 0.811, 0.594</bbox>
+<char>Mansu</char><bbox>0.103, 0.601, 0.181, 0.647</bbox>
+<char>Jugong</char><bbox>0.186, 0.601, 0.273, 0.658</bbox>
+<char>Apt</char><bbox>0.281, 0.601, 0.327, 0.651</bbox>
+<char>42</char><bbox>0.377, 0.601, 0.416, 0.647</bbox>
+<char>Shieung</char><bbox>0.445, 0.578, 0.53, 0.623</bbox>
+<char>인천대공원</char><bbox>0.431, 0.623, 0.609, 0.684</bbox>
+<char>모래내시장역</char><bbox>0.651, 0.591, 0.873, 0.664</bbox>
+<char>IncheonGrand</char><bbox>0.433, 0.684, 0.561, 0.723</bbox>
+<char>Park</char><bbox>0.564, 0.684, 0.611, 0.723</bbox>
+```
+<img src="assets/ocr_2.jpg" alt="OCR Example" width="350"/>
+## Citing the Model
+(*bibtex will be updated soon..*) If you use VARCO-VISION-14B in your research, please cite the following:
+```bibtex
+```

added_tokens.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "</bbox>": 151673,
+  "</char>": 151669,
+  "</obj>": 151671,
+  "</tool_call>": 151658,
+  "<bbox>": 151672,
+  "<char>": 151668,
+  "<delim>": 151674,
+  "<gro>": 151666,
+  "<image>": 151675,
+  "<obj>": 151670,
+  "<ocr>": 151667,
+  "<tool_call>": 151657,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652,
+  "[UNK]": 151665
+}

chat_template.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "chat_template": "{% if messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{% else %}{{ '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}{% endif %}{% for message in messages %}{{ '<|im_start|>' + message['role'] + '\n' }}{# Render all images first #}{% for content in message['content'] | selectattr('type', 'equalto', 'image') %}{{ '<image>\n' }}{% endfor %}{# Render all text next #}{% if message['role'] != 'assistant' %}{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}{{ content['text'] }}{% endfor %}{% else %}{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}{% generation %}{{ content['text'] }}{% endgeneration %}{% endfor %}{% endif %}{{ '<|im_end|>\n' }}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
+}

config.json ADDED Viewed

	@@ -0,0 +1,190 @@

+{
+  "architectures": [
+    "LlavaOnevisionForConditionalGeneration"
+  ],
+  "image_grid_pinpoints": [
+    [
+      384,
+      384
+    ],
+    [
+      384,
+      768
+    ],
+    [
+      384,
+      1152
+    ],
+    [
+      384,
+      1536
+    ],
+    [
+      384,
+      1920
+    ],
+    [
+      384,
+      2304
+    ],
+    [
+      768,
+      384
+    ],
+    [
+      768,
+      768
+    ],
+    [
+      768,
+      1152
+    ],
+    [
+      768,
+      1536
+    ],
+    [
+      768,
+      1920
+    ],
+    [
+      768,
+      2304
+    ],
+    [
+      1152,
+      384
+    ],
+    [
+      1152,
+      768
+    ],
+    [
+      1152,
+      1152
+    ],
+    [
+      1152,
+      1536
+    ],
+    [
+      1152,
+      1920
+    ],
+    [
+      1152,
+      2304
+    ],
+    [
+      1536,
+      384
+    ],
+    [
+      1536,
+      768
+    ],
+    [
+      1536,
+      1152
+    ],
+    [
+      1536,
+      1536
+    ],
+    [
+      1536,
+      1920
+    ],
+    [
+      1536,
+      2304
+    ],
+    [
+      1920,
+      384
+    ],
+    [
+      1920,
+      768
+    ],
+    [
+      1920,
+      1152
+    ],
+    [
+      1920,
+      1536
+    ],
+    [
+      1920,
+      1920
+    ],
+    [
+      1920,
+      2304
+    ],
+    [
+      2304,
+      384
+    ],
+    [
+      2304,
+      768
+    ],
+    [
+      2304,
+      1152
+    ],
+    [
+      2304,
+      1536
+    ],
+    [
+      2304,
+      1920
+    ],
+    [
+      2304,
+      2304
+    ]
+  ],
+  "image_token_index": 151675,
+  "model_type": "llava_onevision",
+  "projector_hidden_act": "gelu",
+  "text_config": {
+    "_name_or_path": "Qwen/Qwen2.5-14B-Instruct",
+    "architectures": [
+      "Qwen2ForCausalLM"
+    ],
+    "bos_token_id": 151643,
+    "eos_token_id": 151645,
+    "hidden_size": 5120,
+    "intermediate_size": 13824,
+    "max_window_layers": 70,
+    "model_type": "qwen2",
+    "num_attention_heads": 40,
+    "num_hidden_layers": 48,
+    "num_key_value_heads": 8,
+    "rope_theta": 1000000.0,
+    "torch_dtype": "bfloat16",
+    "vocab_size": 151680
+  },
+  "tie_word_embeddings": false,
+  "torch_dtype": "float16",
+  "transformers_version": "4.47.0.dev0",
+  "use_image_newline_parameter": true,
+  "video_token_index": 151647,
+  "vision_aspect_ratio": "anyres_max_9",
+  "vision_config": {
+    "hidden_size": 1152,
+    "image_size": 384,
+    "intermediate_size": 4304,
+    "model_type": "siglip_vision_model",
+    "num_attention_heads": 16,
+    "num_hidden_layers": 26,
+    "patch_size": 14,
+    "vision_use_head": false
+  },
+  "vision_feature_layer": -1,
+  "vision_feature_select_strategy": "full"
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 151643,
+  "eos_token_id": 151645,
+  "transformers_version": "4.47.0.dev0"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model-00001-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c56966c66fabd98c18f4dbf7aa39d52d39e3f82440c6aafc9179e27eabb21a70
+size 4882570368

model-00002-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:98d9a2b11900d13fa7af14b59a7d3da82c6604fcd4d858b1d994ce9f9eaf059b
+size 4954848840

model-00003-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3827bd36339c072a3a86b7aee5a47bbd48f5728fc3e24c148c325a64f46f7e59
+size 4954848904

model-00004-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:daa9cd2e696234f1bd02c0a3c874065202bc2c896aebf3574539412c4ec2a24b
+size 4954848904

model-00005-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a4790292eeb4e156e5f22a34b45499cdd60cbd00885e47dde644cd772009f6e4
+size 4954848904

model-00006-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b6f0414ce51a8db1080653fdef8faee9a1bf987ae7f3c75f2bb8e9b3f4b08061
+size 4136918232

model-00007-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:085b7df07c04471a0dc8d8014a82585624b71f2c66b4d0d815a64437b67e964f
+size 1553203344

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,171 @@

+{
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_pad": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_grid_pinpoints": [
+    [
+      384,
+      384
+    ],
+    [
+      384,
+      768
+    ],
+    [
+      384,
+      1152
+    ],
+    [
+      384,
+      1536
+    ],
+    [
+      384,
+      1920
+    ],
+    [
+      384,
+      2304
+    ],
+    [
+      768,
+      384
+    ],
+    [
+      768,
+      768
+    ],
+    [
+      768,
+      1152
+    ],
+    [
+      768,
+      1536
+    ],
+    [
+      768,
+      1920
+    ],
+    [
+      768,
+      2304
+    ],
+    [
+      1152,
+      384
+    ],
+    [
+      1152,
+      768
+    ],
+    [
+      1152,
+      1152
+    ],
+    [
+      1152,
+      1536
+    ],
+    [
+      1152,
+      1920
+    ],
+    [
+      1152,
+      2304
+    ],
+    [
+      1536,
+      384
+    ],
+    [
+      1536,
+      768
+    ],
+    [
+      1536,
+      1152
+    ],
+    [
+      1536,
+      1536
+    ],
+    [
+      1536,
+      1920
+    ],
+    [
+      1536,
+      2304
+    ],
+    [
+      1920,
+      384
+    ],
+    [
+      1920,
+      768
+    ],
+    [
+      1920,
+      1152
+    ],
+    [
+      1920,
+      1536
+    ],
+    [
+      1920,
+      1920
+    ],
+    [
+      1920,
+      2304
+    ],
+    [
+      2304,
+      384
+    ],
+    [
+      2304,
+      768
+    ],
+    [
+      2304,
+      1152
+    ],
+    [
+      2304,
+      1536
+    ],
+    [
+      2304,
+      1920
+    ],
+    [
+      2304,
+      2304
+    ]
+  ],
+  "image_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "image_processor_type": "LlavaOnevisionImageProcessor",
+  "image_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "processor_class": "LlavaOnevisionProcessor",
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "height": 384,
+    "width": 384
+  }
+}

processor_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "image_token": "<image>",
+  "num_image_tokens": 729,
+  "processor_class": "LlavaOnevisionProcessor",
+  "video_token": "<video>",
+  "vision_feature_select_strategy": "full"
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,88 @@

+{
+  "additional_special_tokens": [
+    {
+      "content": "<gro>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<ocr>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<char>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</char>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<obj>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</obj>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<bbox>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</bbox>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<delim>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    }
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e74a2d4fb77a18dd57a5880efa6689e2691246f5424642359b731bc1a9e41657
+size 11423909

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,293 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151666": {
+      "content": "<gro>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151667": {
+      "content": "<ocr>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151668": {
+      "content": "<char>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151669": {
+      "content": "</char>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151670": {
+      "content": "<obj>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151671": {
+      "content": "</obj>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151672": {
+      "content": "<bbox>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151673": {
+      "content": "</bbox>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151674": {
+      "content": "<delim>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151675": {
+      "content": "<image>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<gro>",
+    "<ocr>",
+    "<char>",
+    "</char>",
+    "<obj>",
+    "</obj>",
+    "<bbox>",
+    "</bbox>",
+    "<delim>"
+  ],
+  "bos_token": null,
+  "chat_template": "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- messages[0]['content'] }}\n    {%- else %}\n        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n    {%- endif %}\n    {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n    {%- else %}\n        {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "processor_class": "LlavaOnevisionProcessor",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": "[UNK]"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff