luodian commited on
Commit
9c12cd8
·
verified ·
1 Parent(s): 1954843

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +609 -0
README.md ADDED
@@ -0,0 +1,609 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - lmms-lab/LLaVA-OneVision-Data
5
+ language:
6
+ - en
7
+ - zh
8
+ metrics:
9
+ - accuracy
10
+ library_name: transformers
11
+ tags:
12
+ - multimodal
13
+
14
+ model-index:
15
+ - name: llava-onevision-qwen-0.5b-ov
16
+ results:
17
+ - task:
18
+ type: multimodal
19
+ dataset:
20
+ type: ai2d
21
+ name: AI2D
22
+ metrics:
23
+ - name: accuracy
24
+ type: accuracy
25
+ value: 57.1
26
+ verified: true
27
+ - task:
28
+ type: multimodal
29
+ dataset:
30
+ type: chartqa
31
+ name: ChartQA
32
+ metrics:
33
+ - name: accuracy
34
+ type: accuracy
35
+ value: 61.4
36
+ verified: true
37
+ - task:
38
+ type: multimodal
39
+ dataset:
40
+ type: docvqa
41
+ name: DocVQA
42
+ metrics:
43
+ - name: accuracy
44
+ type: accuracy
45
+ value: 73.7
46
+ verified: true
47
+ - task:
48
+ type: multimodal
49
+ dataset:
50
+ type: infovqa
51
+ name: InfoVQA
52
+ metrics:
53
+ - name: accuracy
54
+ type: accuracy
55
+ value: 46.3
56
+ verified: true
57
+ - task:
58
+ type: multimodal
59
+ dataset:
60
+ type: mathverse
61
+ name: MathVerse
62
+ metrics:
63
+ - name: accuracy
64
+ type: accuracy
65
+ value: 17.9
66
+ verified: true
67
+ - task:
68
+ type: multimodal
69
+ dataset:
70
+ type: mathvista
71
+ name: MathVista
72
+ metrics:
73
+ - name: accuracy
74
+ type: accuracy
75
+ value: 34.8
76
+ verified: true
77
+ - task:
78
+ type: multimodal
79
+ dataset:
80
+ type: mmbench
81
+ name: MMBench
82
+ metrics:
83
+ - name: accuracy
84
+ type: accuracy
85
+ value: 52.1
86
+ verified: true
87
+ - task:
88
+ type: multimodal
89
+ dataset:
90
+ type: mme-perception
91
+ name: MME-Perception
92
+ metrics:
93
+ - name: score
94
+ type: score
95
+ value: 1238
96
+ verified: true
97
+ - task:
98
+ type: multimodal
99
+ dataset:
100
+ type: mme-cognition
101
+ name: MME-Cognition
102
+ metrics:
103
+ - name: score
104
+ type: score
105
+ value: 240
106
+ verified: true
107
+ - task:
108
+ type: multimodal
109
+ dataset:
110
+ type: mmmu
111
+ name: MMMU
112
+ metrics:
113
+ - name: accuracy
114
+ type: accuracy
115
+ value: 31.4
116
+ verified: true
117
+ - task:
118
+ type: multimodal
119
+ dataset:
120
+ type: mmvet
121
+ name: MMVet
122
+ metrics:
123
+ - name: accuracy
124
+ type: accuracy
125
+ value: 29.1
126
+ verified: true
127
+ - task:
128
+ type: multimodal
129
+ dataset:
130
+ type: mmstar
131
+ name: MMStar
132
+ metrics:
133
+ - name: accuracy
134
+ type: accuracy
135
+ value: 37.5
136
+ verified: true
137
+ - task:
138
+ type: multimodal
139
+ dataset:
140
+ type: seed-bench
141
+ name: Seed-Bench
142
+ metrics:
143
+ - name: accuracy
144
+ type: accuracy
145
+ value: 65.5
146
+ verified: true
147
+ - task:
148
+ type: multimodal
149
+ dataset:
150
+ type: science-qa
151
+ name: Science-QA
152
+ metrics:
153
+ - name: accuracy
154
+ type: accuracy
155
+ value: 67.2
156
+ verified: true
157
+ - task:
158
+ type: multimodal
159
+ dataset:
160
+ type: imagedc
161
+ name: ImageDC
162
+ metrics:
163
+ - name: accuracy
164
+ type: accuracy
165
+ value: 83.3
166
+ verified: true
167
+ - task:
168
+ type: multimodal
169
+ dataset:
170
+ type: mmlbench
171
+ name: MMLBench
172
+ metrics:
173
+ - name: accuracy
174
+ type: accuracy
175
+ value: 49.9
176
+ verified: true
177
+ - task:
178
+ type: multimodal
179
+ dataset:
180
+ type: realworldqa
181
+ name: RealWorldQA
182
+ metrics:
183
+ - name: accuracy
184
+ type: accuracy
185
+ value: 55.6
186
+ verified: true
187
+ - task:
188
+ type: multimodal
189
+ dataset:
190
+ type: vibe-eval
191
+ name: Vibe-Eval
192
+ metrics:
193
+ - name: accuracy
194
+ type: accuracy
195
+ value: 33.8
196
+ verified: true
197
+ - task:
198
+ type: multimodal
199
+ dataset:
200
+ type: llava-w
201
+ name: LLaVA-W
202
+ metrics:
203
+ - name: accuracy
204
+ type: accuracy
205
+ value: 74.2
206
+ verified: true
207
+ - task:
208
+ type: multimodal
209
+ dataset:
210
+ type: l-wilder
211
+ name: L-Wilder
212
+ metrics:
213
+ - name: accuracy
214
+ type: accuracy
215
+ value: 55.0
216
+ verified: true
217
+ - task:
218
+ type: multimodal
219
+ dataset:
220
+ type: actnet-qa
221
+ name: ActNet-QA
222
+ metrics:
223
+ - name: accuracy
224
+ type: accuracy
225
+ value: 50.5
226
+ verified: true
227
+ - task:
228
+ type: multimodal
229
+ dataset:
230
+ type: egoschema
231
+ name: EgoSchema
232
+ metrics:
233
+ - name: accuracy
234
+ type: accuracy
235
+ value: 26.8
236
+ verified: true
237
+ - task:
238
+ type: multimodal
239
+ dataset:
240
+ type: mlvu
241
+ name: MLVU
242
+ metrics:
243
+ - name: accuracy
244
+ type: accuracy
245
+ value: 50.3
246
+ verified: true
247
+ - task:
248
+ type: multimodal
249
+ dataset:
250
+ type: mvbench
251
+ name: MVBench
252
+ metrics:
253
+ - name: accuracy
254
+ type: accuracy
255
+ value: 45.5
256
+ verified: true
257
+ - task:
258
+ type: multimodal
259
+ dataset:
260
+ type: nextqa
261
+ name: NextQA
262
+ metrics:
263
+ - name: accuracy
264
+ type: accuracy
265
+ value: 57.2
266
+ verified: true
267
+ - task:
268
+ type: multimodal
269
+ dataset:
270
+ type: percepTest
271
+ name: PercepTest
272
+ metrics:
273
+ - name: accuracy
274
+ type: accuracy
275
+ value: 49.2
276
+ verified: true
277
+ - task:
278
+ type: multimodal
279
+ dataset:
280
+ type: seedbench
281
+ name: SeedBench
282
+ metrics:
283
+ - name: accuracy
284
+ type: accuracy
285
+ value: 44.2
286
+ verified: true
287
+ - task:
288
+ type: multimodal
289
+ dataset:
290
+ type: videochatgpt
291
+ name: VideoChatGPT
292
+ metrics:
293
+ - name: score
294
+ type: score
295
+ value: 3.12
296
+ verified: true
297
+ - task:
298
+ type: multimodal
299
+ dataset:
300
+ type: videodc
301
+ name: VideoDC
302
+ metrics:
303
+ - name: score
304
+ type: score
305
+ value: 3.55
306
+ verified: true
307
+ - task:
308
+ type: multimodal
309
+ dataset:
310
+ type: videomme
311
+ name: VideoMME
312
+ metrics:
313
+ - name: accuracy
314
+ type: accuracy
315
+ value: 44.0
316
+ verified: true
317
+ - task:
318
+ type: multimodal
319
+ dataset:
320
+ type: iei
321
+ name: Image Edit Instruction
322
+ metrics:
323
+ - name: accuracy
324
+ type: accuracy
325
+ value: 17.1
326
+ verified: true
327
+ - task:
328
+ type: multimodal
329
+ dataset:
330
+ type: mi-vqa
331
+ name: MI-VQA
332
+ metrics:
333
+ - name: accuracy
334
+ type: accuracy
335
+ value: 48.7
336
+ verified: true
337
+ - task:
338
+ type: multimodal
339
+ dataset:
340
+ type: nlvr2
341
+ name: NLVR2
342
+ metrics:
343
+ - name: accuracy
344
+ type: accuracy
345
+ value: 63.4
346
+ verified: true
347
+ - task:
348
+ type: multimodal
349
+ dataset:
350
+ type: puzzle
351
+ name: Puzzle
352
+ metrics:
353
+ - name: accuracy
354
+ type: accuracy
355
+ value: 35.4
356
+ verified: true
357
+ - task:
358
+ type: multimodal
359
+ dataset:
360
+ type: q-bench
361
+ name: Q-Bench
362
+ metrics:
363
+ - name: accuracy
364
+ type: accuracy
365
+ value: 48.8
366
+ verified: true
367
+ - task:
368
+ type: multimodal
369
+ dataset:
370
+ type: spot-diff
371
+ name: Spot-Diff
372
+ metrics:
373
+ - name: accuracy
374
+ type: accuracy
375
+ value: 36.4
376
+ verified: true
377
+ - task:
378
+ type: multimodal
379
+ dataset:
380
+ type: tr-vqa
381
+ name: TR-VQA
382
+ metrics:
383
+ - name: accuracy
384
+ type: accuracy
385
+ value: 65.0
386
+ verified: true
387
+ - task:
388
+ type: multimodal
389
+ dataset:
390
+ type: vst
391
+ name: VST
392
+ metrics:
393
+ - name: accuracy
394
+ type: accuracy
395
+ value: 29.8
396
+ verified: true
397
+ - task:
398
+ type: multimodal
399
+ dataset:
400
+ type: scannet-chat
401
+ name: ScanNet-Chat
402
+ metrics:
403
+ - name: accuracy
404
+ type: accuracy
405
+ value: 60.00
406
+ verified: true
407
+ - task:
408
+ type: multimodal
409
+ dataset:
410
+ type: scannet-td
411
+ name: ScanNet-TD
412
+ metrics:
413
+ - name: accuracy
414
+ type: accuracy
415
+ value: 48.00
416
+ verified: true
417
+ - task:
418
+ type: multimodal
419
+ dataset:
420
+ type: scanqa
421
+ name: ScanQA
422
+ metrics:
423
+ - name: accuracy
424
+ type: accuracy
425
+ value: 29.40
426
+ verified: true
427
+ - task:
428
+ type: multimodal
429
+ dataset:
430
+ type: alfred
431
+ name: ALFRED
432
+ metrics:
433
+ - name: accuracy
434
+ type: accuracy
435
+ value: 62.20
436
+ verified: true
437
+ - task:
438
+ type: multimodal
439
+ dataset:
440
+ type: nuscenesvqa
441
+ name: nuScenesVQA
442
+ metrics:
443
+ - name: accuracy
444
+ type: accuracy
445
+ value: 70.50
446
+ verified: true
447
+ - task:
448
+ type: multimodal
449
+ dataset:
450
+ type: blink
451
+ name: BLINK
452
+ metrics:
453
+ - name: accuracy
454
+ type: accuracy
455
+ value: 52.1
456
+ verified: true
457
+ - task:
458
+ type: multimodal
459
+ dataset:
460
+ type: mantis
461
+ name: Mantis
462
+ metrics:
463
+ - name: accuracy
464
+ type: accuracy
465
+ value: 39.6
466
+ verified: true
467
+ - task:
468
+ type: multimodal
469
+ dataset:
470
+ type: mathverse-mv
471
+ name: MathVerse-mv
472
+ metrics:
473
+ - name: accuracy
474
+ type: accuracy
475
+ value: 60.0
476
+ verified: true
477
+ - task:
478
+ type: multimodal
479
+ dataset:
480
+ type: muirbench
481
+ name: MuirBench
482
+ metrics:
483
+ - name: accuracy
484
+ type: accuracy
485
+ value: 25.5
486
+ verified: true
487
+ - task:
488
+ type: multimodal
489
+ dataset:
490
+ type: sciverse-mv
491
+ name: SciVerse-mv
492
+ metrics:
493
+ - name: accuracy
494
+ type: accuracy
495
+ value: 29.1
496
+ verified: true
497
+ ---
498
+
499
+
500
+ # LLaVA-OneVision
501
+
502
+ ![banner](https://i.postimg.cc/pL17YtG4/WX20240508-220230-2x.png)
503
+
504
+ Play with the model on the [LLaVA OneVision Chat](https://llava-onevision.lmms-lab.com/).
505
+
506
+ ## Table of Contents
507
+
508
+ 1. [Model Summary](##model-summary)
509
+ 2. [Use](##use)
510
+ 3. [Limitations](##limitations)
511
+ 4. [Training](##training)
512
+ 5. [License](##license)
513
+ 6. [Citation](##citation)
514
+
515
+ ## Model Summary
516
+
517
+ The LLaVA-OneVision models are 0.5/7/72B parameter models trained on [LLaVA-OneVision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), based on Qwen2 language model with a context window of 32K tokens.
518
+
519
+ - **Repository:** [LLaVA-VL/LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT?tab=readme-ov-file)
520
+ - **Project Website:** [llava-onevision.lmms-lab.com](llava-onevision.lmms-lab.com)
521
+ - **Paper:** [LLaVA-OneVision]()
522
+ - **Point of Contact:** [Bo Li](mailto:[email protected])
523
+ - **Languages:** English, Chinese
524
+
525
+
526
+ ## Use
527
+
528
+ ### Intended use
529
+
530
+ The model was trained on [LLaVA-OneVision Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data) and have the ability to interact with images, multi-image and videos.
531
+
532
+ **Feel free to share your generations in the Community tab!**
533
+
534
+ ### Generation
535
+ ```python
536
+ # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
537
+ from llava.model.builder import load_pretrained_model
538
+ from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
539
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
540
+ from llava.conversation import conv_templates, SeparatorStyle
541
+
542
+ from PIL import Image
543
+ import requests
544
+ import copy
545
+ import torch
546
+
547
+ import sys
548
+ import warnings
549
+
550
+ warnings.filterwarnings("ignore")
551
+ pretrained = "lmms-lab/llava-onevision-qwen2-0.5b-si"
552
+ model_name = "llava_qwen"
553
+ device = "cuda"
554
+ device_map = "auto"
555
+ tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
556
+
557
+ model.eval()
558
+
559
+ url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
560
+ image = Image.open(requests.get(url, stream=True).raw)
561
+ image_tensor = process_images([image], image_processor, model.config)
562
+ image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
563
+
564
+ conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
565
+ question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
566
+ conv = copy.deepcopy(conv_templates[conv_template])
567
+ conv.append_message(conv.roles[0], question)
568
+ conv.append_message(conv.roles[1], None)
569
+ prompt_question = conv.get_prompt()
570
+
571
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
572
+ image_sizes = [image.size]
573
+
574
+
575
+ cont = model.generate(
576
+ input_ids,
577
+ images=image_tensor,
578
+ image_sizes=image_sizes,
579
+ do_sample=False,
580
+ temperature=0,
581
+ max_new_tokens=4096,
582
+ )
583
+ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
584
+ print(text_outputs)
585
+ ```
586
+
587
+ # Training
588
+
589
+ ## Model
590
+
591
+ - **Architecture:** SO400M + Qwen2
592
+ - **Pretraining Stage:** LCS-558K, 1 epoch, projector
593
+ - **Mid Stage:** A mixture of 4.7M high-quality synthetic data, 1 epoch, full model
594
+ - **Final-Image Stage:** A mixture of 3.6M single-image data, 1 epoch, full model
595
+ - **OneVision Stage:** A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model
596
+ - **Precision:** bfloat16
597
+
598
+ ## Hardware & Software
599
+
600
+ - **GPUs:** 256 * Nvidia Tesla A100 (for whole model series training)
601
+ - **Orchestration:** [Huggingface Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)
602
+ - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
603
+
604
+ # Citation
605
+ ```
606
+ @article{li2024llavaonevision,
607
+ title={LLaVA-OneVision},
608
+ }
609
+ ```