kimyoungjune commited on
Commit
99d39db
ยท
verified ยท
1 Parent(s): 0c7df25

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +232 -226
README.md CHANGED
@@ -1,227 +1,233 @@
1
- ---
2
- language:
3
- - en
4
- - ko
5
- license: cc-by-nc-4.0
6
- tags:
7
- - multimodal
8
- - conversational
9
- - ncsoft
10
- - varco
11
- base_model:
12
- - Qwen/Qwen2.5-14B-Instruct
13
- - google/siglip-so400m-patch14-384
14
- library_name: transformers
15
- ---
16
-
17
- # VARCO-VISION-14B
18
-
19
- ## About the Model
20
-
21
- **VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM) developed through four distinct training phases, culminating in a final preference optimization stage. Designed to excel in both multimodal and text-only tasks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models. The model currently accepts a single image and accompanying text as input, generating text as output. It supports groundingโ€”the ability to identify the locations of objects within an imageโ€”as well as OCR (Optical Character Recognition) to recognize text within images.
22
-
23
- - **Developed by:** NC Research, Multimodal Generation Team
24
- - **Technical Report:** [Coming Soon]()
25
- - **Demo Page:** [Coming Soon]()
26
- - **Languages:** Korean, English
27
- - **License:** CC BY-NC 4.0
28
- - **Architecture:** VARCO-VISION-14B follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
29
- - **Base Model:**
30
- - **Language Model:** [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
31
- - **Vision Encoder:** [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
32
- - **Huggingface Version Model:** [NCSOFT/VARCO-VISION-14B-HF](https://huggingface.co/NCSOFT/VARCO-VISION-14B-HF)
33
-
34
-
35
-
36
- ## Uses
37
-
38
- ### Direct Use
39
-
40
- To load VARCO-VISION-14B, start by cloning and installing **LLaVA-NeXT**:
41
-
42
- ```bash
43
- git clone https://github.com/LLaVA-VL/LLaVA-NeXT
44
- cd LLaVA-NeXT
45
- pip install -e ".[train]"
46
- ```
47
-
48
- After installing **LLaVA-NeXT**, you can load VARCO-VISION-14B using the following code:
49
-
50
- ```python
51
- import torch
52
- from transformers import AutoTokenizer
53
- from llava.model.language_model.llava_qwen import LlavaQwenForCausalLM
54
- from llava.conversation import apply_chat_template
55
- from llava.mm_utils import tokenizer_image_token, process_images
56
-
57
- model_name = "NCSOFT/VARCO-VISION-14B"
58
- tokenizer = AutoTokenizer.from_pretrained(model_name)
59
- model = LlavaQwenForCausalLM.from_pretrained(
60
- model_name,
61
- torch_dtype=torch.float16,
62
- attn_implementation="flash_attention_2",
63
- low_cpu_mem_usage=True,
64
- device_map="auto"
65
- )
66
-
67
- vision_tower = model.get_vision_tower()
68
- image_processor = vision_tower.image_processor
69
- ```
70
-
71
- Prepare the image and text input by preprocessing the image and tokenizing the text. Pass the processed inputs to the model to generate predictions.
72
-
73
- ```python
74
- import requests
75
- from PIL import Image
76
-
77
- # Define a chat history and use `apply_chat_template` to get correctly formatted prompt
78
- # Each value in "content" has to be a list of dicts with types ("text", "image")
79
- conversation = [
80
- {
81
- "role": "user",
82
- "content": [
83
- {"type": "text", "text": "Describe this image."},
84
- {"type": "image"},
85
- ],
86
- },
87
- ]
88
- prompt = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
89
-
90
- IMAGE_TOKEN_INDEX = -200
91
- EOS_TOKEN = "<|im_end|>"
92
- input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
93
- input_ids = input_ids.unsqueeze(0).to(model.device)
94
-
95
- image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
96
- raw_image = Image.open(requests.get(image_url, stream=True).raw)
97
- image_tensors = process_images([raw_image], image_processor, model.config)
98
- image_tensors = [image_tensor.half().to(model.device) for image_tensor in image_tensors]
99
- image_sizes = [raw_image.size]
100
-
101
- with torch.inference_mode():
102
- output_ids = model.generate(
103
- input_ids,
104
- images=image_tensors,
105
- image_sizes=image_sizes,
106
- do_sample=False,
107
- max_new_tokens=1024,
108
- use_cache=True,
109
- )
110
-
111
- outputs = tokenizer.batch_decode(output_ids)[0]
112
- if outputs.endswith(EOS_TOKEN):
113
- outputs = outputs[: -len(EOS_TOKEN)]
114
-
115
- outputs = outputs.strip()
116
- print(outputs)
117
- ```
118
-
119
- ### Specialized Features
120
-
121
- To receive questions or answers based on bounding boxes (e.g., grounding, referring, OCR tasks), include special tokens in the input text.
122
-
123
- The following special tokens are used to define specific tasks, inputs and outputs for the model:
124
-
125
- - `<gro>`: Indicates that the model's response should include bounding box information.
126
- - `<ocr>`: Specifies OCR tasks for recognizing text within an image.
127
- - `<char>` and `</char>`: Used to mark a text phrase.
128
- - `<obj>` and `</obj>`: Used to indicate an object.
129
- - `<bbox>` and `</bbox>`: Used to represent a bounding box.
130
- - `<delim>`: Represents multiple location points for a single object or text.
131
-
132
- #### Grounding
133
-
134
- Grounding refers to the task where the model identifies specific locations within an image to provide an answer. To perform grounding, prepend the special token `<gro>` to the question.
135
-
136
- ```python
137
- conversation = [
138
- {
139
- "role": "user",
140
- "content": [
141
- {"type": "text", "text": "<gro>\nDescribe the image in detail."},
142
- {"type": "image"},
143
- ],
144
- },
145
- ]
146
- ```
147
-
148
- **Expected Output Example:**
149
- ```html
150
- The image shows <obj>two cats</obj><bbox>0.521, 0.049, 0.997, 0.783<delim>0.016, 0.108, 0.512, 0.99</bbox> lying on <obj>a pink blanket</obj><bbox>0.002, 0.231, 0.999, 0.999</bbox>. The cat on the left is lying on its side with its head resting on the blanket and its body stretched out. The cat on the right is lying on its back with its paws stretched out and its head turned to the side. Both cats appear relaxed and comfortable. There are also <obj>two remote controls</obj><bbox>0.039, 0.138, 0.283, 0.257<delim>0.508, 0.166, 0.581, 0.295</bbox> placed near the cats, one on each side of them.
151
- ```
152
-
153
- <img src="assets/grounding.png" alt="Grounding Example" width="400"/>
154
-
155
- #### Referring
156
-
157
- VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, structure the conversation by including the object of interest within `<obj>` and `</obj>` tags and specifying its location with `<bbox>` and `</bbox>` tags. This allows the model to understand the context and focus on the object at the specified location.
158
-
159
- ```python
160
- conversation = [
161
- {
162
- "role": "user",
163
- "content": [
164
- {
165
- "type": "text",
166
- "text": "<obj>์ด ๋ฌผ๊ฑด</obj><bbox>0.039, 0.138, 0.283, 0.257</bbox>์€ ์–ด๋–ป๊ฒŒ ์“ฐ๋Š”๊ฑฐ์•ผ?",
167
- },
168
- {"type": "image"},
169
- ],
170
- },
171
- ]
172
- ```
173
-
174
- **Expected Output Example:**
175
- ```
176
- **์ด ๋ฌผ๊ฑด**์€ ๋ฆฌ๋ชจ์ปจ์œผ๋กœ, ์ฃผ๋กœ ํ…”๋ ˆ๋น„์ „์ด๋‚˜ ๋‹ค๋ฅธ ์ „์ž ๊ธฐ๊ธฐ๋ฅผ ์›๊ฒฉ์œผ๋กœ ์กฐ์ž‘ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด ์ฑ„๋„ ๋ณ€๊ฒฝ, ๋ณผ๋ฅจ ์กฐ์ ˆ, ์ „์› ์ผœ๊ธฐ/๋„๊ธฐ ๋“ฑ์˜ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฆฌ๋ชจ์ปจ์˜ ๋ฒ„ํŠผ์—๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ˆซ์ž, ๋ฉ”๋‰ด, ์„ค์ •, ์žฌ์ƒ/์ผ์‹œ์ •์ง€ ๋“ฑ์˜ ๊ธฐ๋Šฅ์ด ํฌํ•จ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์‚ฌ์šฉ์ž๋Š” ์ด๋ฅผ ํ†ตํ•ด ์†์‰ฝ๊ฒŒ ๊ธฐ๊ธฐ๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
177
- ```
178
-
179
- #### OCR
180
-
181
- To perform Optical Character Recognition (OCR), use the `<ocr>` token.
182
-
183
- ```python
184
- image_file = "./assets/ocr_1.png"
185
- raw_image = Image.open(image_file)
186
-
187
- conversation = [
188
- {
189
- "role": "user",
190
- "content": [
191
- {"type": "text", "text": "<ocr>"},
192
- {"type": "image"},
193
- ],
194
- },
195
- ]
196
- ```
197
-
198
- **Expected Output Example:**
199
-
200
- ```
201
- <char>๋ฐฑ๋ฒ”๋กœ</char><bbox>0.172, 0.265, 0.328, 0.34</bbox>
202
- <char>124๋ฒˆ๊ธธ</char><bbox>0.349, 0.265, 0.512, 0.34</bbox>
203
- <char>Baekbeom-ro</char><bbox>0.171, 0.335, 0.432, 0.391</bbox>
204
- <char>124</char><bbox>0.444, 0.34, 0.508, 0.391</bbox>
205
- <char>๋งŒ์ˆ˜์ฃผ๊ณต์•„ํŒŒํŠธ</char><bbox>0.109, 0.528, 0.335, 0.594</bbox>
206
- <char>์‹œํฅ</char><bbox>0.443, 0.516, 0.522, 0.578</bbox>
207
- <char>์‹œ์ฒญ</char><bbox>0.711, 0.521, 0.811, 0.594</bbox>
208
- <char>Mansu</char><bbox>0.103, 0.601, 0.181, 0.647</bbox>
209
- <char>Jugong</char><bbox>0.186, 0.601, 0.273, 0.658</bbox>
210
- <char>Apt</char><bbox>0.281, 0.601, 0.327, 0.651</bbox>
211
- <char>42</char><bbox>0.377, 0.601, 0.416, 0.647</bbox>
212
- <char>Shieung</char><bbox>0.445, 0.578, 0.53, 0.623</bbox>
213
- <char>์ธ์ฒœ๋Œ€๊ณต์›</char><bbox>0.431, 0.623, 0.609, 0.684</bbox>
214
- <char>๋ชจ๋ž˜๋‚ด์‹œ์žฅ์—ญ</char><bbox>0.651, 0.591, 0.873, 0.664</bbox>
215
- <char>IncheonGrand</char><bbox>0.433, 0.684, 0.561, 0.723</bbox>
216
- <char>Park</char><bbox>0.564, 0.684, 0.611, 0.723</bbox>
217
- ```
218
-
219
- <img src="assets/ocr_2.jpg" alt="OCR Example" width="350"/>
220
-
221
- ## Citing the Model
222
-
223
- (*bibtex will be updated soon..*) If you use VARCO-VISION-14B in your research, please cite the following:
224
-
225
- ```bibtex
226
-
 
 
 
 
 
 
227
  ```
 
1
+ ---
2
+ language:
3
+ - en
4
+ - ko
5
+ license: cc-by-nc-4.0
6
+ tags:
7
+ - multimodal
8
+ - conversational
9
+ - ncsoft
10
+ - varco
11
+ base_model:
12
+ - Qwen/Qwen2.5-14B-Instruct
13
+ - google/siglip-so400m-patch14-384
14
+ library_name: transformers
15
+ ---
16
+
17
+ # VARCO-VISION-14B
18
+
19
+ ## About the Model
20
+
21
+ **VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM) developed through four distinct training phases, culminating in a final preference optimization stage. Designed to excel in both multimodal and text-only tasks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models. The model currently accepts a single image and accompanying text as input, generating text as output. It supports groundingโ€”the ability to identify the locations of objects within an imageโ€”as well as OCR (Optical Character Recognition) to recognize text within images.
22
+
23
+ - **Developed by:** NC Research, Multimodal Generation Team
24
+ - **Technical Report:** [Coming Soon]()
25
+ - **Demo Page:** [Coming Soon]()
26
+ - **Languages:** Korean, English
27
+ - **License:** CC BY-NC 4.0
28
+ - **Architecture:** VARCO-VISION-14B follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
29
+ - **Base Model:**
30
+ - **Language Model:** [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
31
+ - **Vision Encoder:** [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
32
+ - **Huggingface Version Model:** [NCSOFT/VARCO-VISION-14B-HF](https://huggingface.co/NCSOFT/VARCO-VISION-14B-HF)
33
+ - **Korean VLM Test Sets:**
34
+ - [NCSOFT/K-MMBench](https://huggingface.co/datasets/NCSOFT/K-MMBench)
35
+ - [NCSOFT/K-SEED](https://huggingface.co/datasets/NCSOFT/K-SEED)
36
+ - [NCSOFT/K-MMStar](https://huggingface.co/datasets/NCSOFT/K-MMStar)
37
+ - [NCSOFT/K-DTCBench](https://huggingface.co/datasets/NCSOFT/K-DTCBench)
38
+ - [NCSOFT/K-LLaVA-W](https://huggingface.co/datasets/NCSOFT/K-LLaVA-W)
39
+
40
+
41
+
42
+ ## Uses
43
+
44
+ ### Direct Use
45
+
46
+ To load VARCO-VISION-14B, start by cloning and installing **LLaVA-NeXT**:
47
+
48
+ ```bash
49
+ git clone https://github.com/LLaVA-VL/LLaVA-NeXT
50
+ cd LLaVA-NeXT
51
+ pip install -e ".[train]"
52
+ ```
53
+
54
+ After installing **LLaVA-NeXT**, you can load VARCO-VISION-14B using the following code:
55
+
56
+ ```python
57
+ import torch
58
+ from transformers import AutoTokenizer
59
+ from llava.model.language_model.llava_qwen import LlavaQwenForCausalLM
60
+ from llava.conversation import apply_chat_template
61
+ from llava.mm_utils import tokenizer_image_token, process_images
62
+
63
+ model_name = "NCSOFT/VARCO-VISION-14B"
64
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
65
+ model = LlavaQwenForCausalLM.from_pretrained(
66
+ model_name,
67
+ torch_dtype=torch.float16,
68
+ attn_implementation="flash_attention_2",
69
+ low_cpu_mem_usage=True,
70
+ device_map="auto"
71
+ )
72
+
73
+ vision_tower = model.get_vision_tower()
74
+ image_processor = vision_tower.image_processor
75
+ ```
76
+
77
+ Prepare the image and text input by preprocessing the image and tokenizing the text. Pass the processed inputs to the model to generate predictions.
78
+
79
+ ```python
80
+ import requests
81
+ from PIL import Image
82
+
83
+ # Define a chat history and use `apply_chat_template` to get correctly formatted prompt
84
+ # Each value in "content" has to be a list of dicts with types ("text", "image")
85
+ conversation = [
86
+ {
87
+ "role": "user",
88
+ "content": [
89
+ {"type": "text", "text": "Describe this image."},
90
+ {"type": "image"},
91
+ ],
92
+ },
93
+ ]
94
+ prompt = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
95
+
96
+ IMAGE_TOKEN_INDEX = -200
97
+ EOS_TOKEN = "<|im_end|>"
98
+ input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
99
+ input_ids = input_ids.unsqueeze(0).to(model.device)
100
+
101
+ image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
102
+ raw_image = Image.open(requests.get(image_url, stream=True).raw)
103
+ image_tensors = process_images([raw_image], image_processor, model.config)
104
+ image_tensors = [image_tensor.half().to(model.device) for image_tensor in image_tensors]
105
+ image_sizes = [raw_image.size]
106
+
107
+ with torch.inference_mode():
108
+ output_ids = model.generate(
109
+ input_ids,
110
+ images=image_tensors,
111
+ image_sizes=image_sizes,
112
+ do_sample=False,
113
+ max_new_tokens=1024,
114
+ use_cache=True,
115
+ )
116
+
117
+ outputs = tokenizer.batch_decode(output_ids)[0]
118
+ if outputs.endswith(EOS_TOKEN):
119
+ outputs = outputs[: -len(EOS_TOKEN)]
120
+
121
+ outputs = outputs.strip()
122
+ print(outputs)
123
+ ```
124
+
125
+ ### Specialized Features
126
+
127
+ To receive questions or answers based on bounding boxes (e.g., grounding, referring, OCR tasks), include special tokens in the input text.
128
+
129
+ The following special tokens are used to define specific tasks, inputs and outputs for the model:
130
+
131
+ - `<gro>`: Indicates that the model's response should include bounding box information.
132
+ - `<ocr>`: Specifies OCR tasks for recognizing text within an image.
133
+ - `<char>` and `</char>`: Used to mark a text phrase.
134
+ - `<obj>` and `</obj>`: Used to indicate an object.
135
+ - `<bbox>` and `</bbox>`: Used to represent a bounding box.
136
+ - `<delim>`: Represents multiple location points for a single object or text.
137
+
138
+ #### Grounding
139
+
140
+ Grounding refers to the task where the model identifies specific locations within an image to provide an answer. To perform grounding, prepend the special token `<gro>` to the question.
141
+
142
+ ```python
143
+ conversation = [
144
+ {
145
+ "role": "user",
146
+ "content": [
147
+ {"type": "text", "text": "<gro>\nDescribe the image in detail."},
148
+ {"type": "image"},
149
+ ],
150
+ },
151
+ ]
152
+ ```
153
+
154
+ **Expected Output Example:**
155
+ ```html
156
+ The image shows <obj>two cats</obj><bbox>0.521, 0.049, 0.997, 0.783<delim>0.016, 0.108, 0.512, 0.99</bbox> lying on <obj>a pink blanket</obj><bbox>0.002, 0.231, 0.999, 0.999</bbox>. The cat on the left is lying on its side with its head resting on the blanket and its body stretched out. The cat on the right is lying on its back with its paws stretched out and its head turned to the side. Both cats appear relaxed and comfortable. There are also <obj>two remote controls</obj><bbox>0.039, 0.138, 0.283, 0.257<delim>0.508, 0.166, 0.581, 0.295</bbox> placed near the cats, one on each side of them.
157
+ ```
158
+
159
+ <img src="assets/grounding.png" alt="Grounding Example" width="400"/>
160
+
161
+ #### Referring
162
+
163
+ VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, structure the conversation by including the object of interest within `<obj>` and `</obj>` tags and specifying its location with `<bbox>` and `</bbox>` tags. This allows the model to understand the context and focus on the object at the specified location.
164
+
165
+ ```python
166
+ conversation = [
167
+ {
168
+ "role": "user",
169
+ "content": [
170
+ {
171
+ "type": "text",
172
+ "text": "<obj>์ด ๋ฌผ๊ฑด</obj><bbox>0.039, 0.138, 0.283, 0.257</bbox>์€ ์–ด๋–ป๊ฒŒ ์“ฐ๋Š”๊ฑฐ์•ผ?",
173
+ },
174
+ {"type": "image"},
175
+ ],
176
+ },
177
+ ]
178
+ ```
179
+
180
+ **Expected Output Example:**
181
+ ```
182
+ **์ด ๋ฌผ๊ฑด**์€ ๋ฆฌ๋ชจ์ปจ์œผ๋กœ, ์ฃผ๋กœ ํ…”๋ ˆ๋น„์ „์ด๋‚˜ ๋‹ค๋ฅธ ์ „์ž ๊ธฐ๊ธฐ๋ฅผ ์›๊ฒฉ์œผ๋กœ ์กฐ์ž‘ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด ์ฑ„๋„ ๋ณ€๊ฒฝ, ๋ณผ๋ฅจ ์กฐ์ ˆ, ์ „์› ์ผœ๊ธฐ/๋„๊ธฐ ๋“ฑ์˜ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฆฌ๋ชจ์ปจ์˜ ๋ฒ„ํŠผ์—๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ˆซ์ž, ๋ฉ”๋‰ด, ์„ค์ •, ์žฌ์ƒ/์ผ์‹œ์ •์ง€ ๋“ฑ์˜ ๊ธฐ๋Šฅ์ด ํฌํ•จ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์‚ฌ์šฉ์ž๋Š” ์ด๋ฅผ ํ†ตํ•ด ์†์‰ฝ๊ฒŒ ๊ธฐ๊ธฐ๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
183
+ ```
184
+
185
+ #### OCR
186
+
187
+ To perform Optical Character Recognition (OCR), use the `<ocr>` token.
188
+
189
+ ```python
190
+ image_file = "./assets/ocr_1.png"
191
+ raw_image = Image.open(image_file)
192
+
193
+ conversation = [
194
+ {
195
+ "role": "user",
196
+ "content": [
197
+ {"type": "text", "text": "<ocr>"},
198
+ {"type": "image"},
199
+ ],
200
+ },
201
+ ]
202
+ ```
203
+
204
+ **Expected Output Example:**
205
+
206
+ ```
207
+ <char>๋ฐฑ๋ฒ”๋กœ</char><bbox>0.172, 0.265, 0.328, 0.34</bbox>
208
+ <char>124๋ฒˆ๊ธธ</char><bbox>0.349, 0.265, 0.512, 0.34</bbox>
209
+ <char>Baekbeom-ro</char><bbox>0.171, 0.335, 0.432, 0.391</bbox>
210
+ <char>124</char><bbox>0.444, 0.34, 0.508, 0.391</bbox>
211
+ <char>๋งŒ์ˆ˜์ฃผ๊ณต์•„ํŒŒํŠธ</char><bbox>0.109, 0.528, 0.335, 0.594</bbox>
212
+ <char>์‹œํฅ</char><bbox>0.443, 0.516, 0.522, 0.578</bbox>
213
+ <char>์‹œ์ฒญ</char><bbox>0.711, 0.521, 0.811, 0.594</bbox>
214
+ <char>Mansu</char><bbox>0.103, 0.601, 0.181, 0.647</bbox>
215
+ <char>Jugong</char><bbox>0.186, 0.601, 0.273, 0.658</bbox>
216
+ <char>Apt</char><bbox>0.281, 0.601, 0.327, 0.651</bbox>
217
+ <char>42</char><bbox>0.377, 0.601, 0.416, 0.647</bbox>
218
+ <char>Shieung</char><bbox>0.445, 0.578, 0.53, 0.623</bbox>
219
+ <char>์ธ์ฒœ๋Œ€๊ณต์›</char><bbox>0.431, 0.623, 0.609, 0.684</bbox>
220
+ <char>๋ชจ๋ž˜๋‚ด์‹œ์žฅ์—ญ</char><bbox>0.651, 0.591, 0.873, 0.664</bbox>
221
+ <char>IncheonGrand</char><bbox>0.433, 0.684, 0.561, 0.723</bbox>
222
+ <char>Park</char><bbox>0.564, 0.684, 0.611, 0.723</bbox>
223
+ ```
224
+
225
+ <img src="assets/ocr_2.jpg" alt="OCR Example" width="350"/>
226
+
227
+ ## Citing the Model
228
+
229
+ (*bibtex will be updated soon..*) If you use VARCO-VISION-14B in your research, please cite the following:
230
+
231
+ ```bibtex
232
+
233
  ```