[email protected] commited on
Commit
32d8bf4
·
1 Parent(s): 1171108

Update README

Browse files
Files changed (1) hide show
  1. README.md +89 -7
README.md CHANGED
@@ -17,15 +17,24 @@ library_name: transformers
17
  </p>
18
 
19
 
 
20
 
21
- ## Model Details
 
 
 
 
 
 
22
 
23
  Today (September 17th, 2024), we introduce [NVLM 1.0](https://arxiv.org/abs/2409.11402), a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training.
24
 
25
- In this repo, we are open-sourcing NVLM-1.0-D-72B (decoder-only architecture), the decoder-only model weights and code for the community.
 
 
26
 
27
- ## Other Resources
28
- [Inference Code (HF)](https://huggingface.co/nvidia/NVLM-D-72B/tree/main) &ensp; [Training Code (Coming soon)]() &ensp; [Website](https://research.nvidia.com/labs/adlr/NVLM-1/) &ensp; [Paper](https://arxiv.org/abs/2409.11402)
29
 
30
  ## Benchmark Results
31
  We train our model with legacy [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/legacy) and adapt the codebase to Huggingface for model hosting, reproducibility, and inference.
@@ -73,6 +82,22 @@ Results (as of September 17th, 2024) in the multimodal benchmarks are as follows
73
  | NVLM-D 1.0 72B (Huggingface) | (b) | 81.7 | 93.2 | 73.1 | 89.0 | 🥳 84.3 (+4.5) |
74
 
75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
  ## How to use
77
 
78
  When converting Megatron checkpoint to Huggingface, we adapt [InternVL codebase](https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B) to support model loading and multi-GPU inference in HF.
@@ -291,6 +316,66 @@ response = model.chat(tokenizer, pixel_values, question, generation_config)
291
  print(f'User: {question}\nAssistant: {response}')
292
  ```
293
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
294
 
295
  ## Correspondence to
296
  Wenliang Dai* ([email protected]), Nayeon Lee* ([email protected]), Boxin Wang* ([email protected]), Zhuolin Yang* ([email protected]), Wei Ping* ([email protected])
@@ -307,9 +392,6 @@ Wenliang Dai* ([email protected]), Nayeon Lee* ([email protected]), Boxin Wang* (
307
  </pre>
308
 
309
 
310
- ## License
311
- The use of this model is governed by the [cc-by-nc-4.0](https://spdx.org/licenses/CC-BY-NC-4.0)
312
-
313
  ## Ethical Considerations
314
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
315
 
 
17
  </p>
18
 
19
 
20
+ # Model Overview
21
 
22
+ ## Description
23
+ This family of models performs vision-language and text-only tasks including optical character recognition, multimodal reasoning, localization, common sense reasoning, world knowledge utilization, and coding.
24
+
25
+ ## License/Terms of Use
26
+ [Creative Commons Attribution: Non-Commercial 4.0 International](https://spdx.org/licenses/CC-BY-NC-4.0) <br>
27
+
28
+ # Model Details
29
 
30
  Today (September 17th, 2024), we introduce [NVLM 1.0](https://arxiv.org/abs/2409.11402), a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training.
31
 
32
+ In this repo, we are open-sourcing NVLM-1.0-D-72B (decoder-only architecture), the decoder-only model weights and code for the community.
33
+
34
+
35
 
36
+ ## Reference(s)
37
+ [Paper](https://arxiv.org/abs/2409.11402) &ensp; [Inference Code (HF)](https://huggingface.co/nvidia/NVLM-D-72B/tree/main) &ensp; [Training Code (Coming soon)]() &ensp; [Website](https://research.nvidia.com/labs/adlr/NVLM-1/)
38
 
39
  ## Benchmark Results
40
  We train our model with legacy [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/legacy) and adapt the codebase to Huggingface for model hosting, reproducibility, and inference.
 
82
  | NVLM-D 1.0 72B (Huggingface) | (b) | 81.7 | 93.2 | 73.1 | 89.0 | 🥳 84.3 (+4.5) |
83
 
84
 
85
+ ## Model Architectures
86
+
87
+ **Network Architecture:** Decoder-Only Transformer
88
+
89
+ ### Input
90
+ **Input Type(s):** Text, Image <br>
91
+ **Input Format(s):** String, [Pillow Library-Supported Formats](https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html) <br>
92
+ **Input Dimensions:** One-Dimensional (1D), Two Dimensional (2D) <br>
93
+ **Other Properties Related to Input:** Maximum Token Length = 128K Tokens <br>
94
+
95
+ ### Output
96
+ **Output Type(s):** Text <br>
97
+ **Output Format:** String <br>
98
+ **Model Output:** 1D <br>
99
+ **Other Properties Related to Output:** None <br>
100
+
101
  ## How to use
102
 
103
  When converting Megatron checkpoint to Huggingface, we adapt [InternVL codebase](https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B) to support model loading and multi-GPU inference in HF.
 
316
  print(f'User: {question}\nAssistant: {response}')
317
  ```
318
 
319
+ ## Software Integration
320
+ **Runtime Engine(s)**
321
+ * PyTorch <br>
322
+
323
+ **Supported Hardware Microarchitecture Compatibility:** <br>
324
+ * NVIDIA Hopper <br>
325
+
326
+ **[Preferred/Supported] Operating System(s):** <br>
327
+ * Linux <br>
328
+
329
+ ## Inference
330
+ **Engine:** PyTorch <br>
331
+ **Test Hardware:** <br>
332
+ * H100 <br>
333
+
334
+ ## Model Version(s)
335
+ * v1.0-D (NVLM-D)
336
+
337
+ ## Training, Testing, and Evaluation Datasets
338
+
339
+ ### Pre-Training Dataset
340
+
341
+ **Link** <br>
342
+ * [See Table 4](https://arxiv.org/abs/2409.11402) <br>
343
+
344
+ **Data Collection Method by dataset** <br>
345
+ * Hybrid: Automated, Human, Synthetic, Unknown <br>
346
+
347
+ **Labeling Method by dataset** <br>
348
+ * Hybrid: Automated, Human, Synthetic, Unknown <br>
349
+
350
+ **Properties**
351
+ * Trained on image captions, image-text pairs, natural images, charts, documents, scene descriptions, and mathematical reasoning. <br>
352
+
353
+ ### Supervised Fine-Tuning Dataset
354
+ **Link** <br>
355
+ * [See Table 6](https://arxiv.org/abs/2409.11402) <br>
356
+
357
+ **Data Collection Method by dataset** <br>
358
+ * Hybrid: Automated, Human, Synthetic, Unknown <br>
359
+
360
+ **Labeling Method by dataset** <br>
361
+ * Hybrid: Automated, Human, Synthetic, Unknown <br>
362
+
363
+ **Properties**
364
+ * Trained on image captions; general knowledge; image-text pairs; natural images; charts; diagrams; documents; scene descriptions; science diagrams, lessons, textbook data, and question-answer pairs; visual instruction tuning; and mathematical reasoning. <br>
365
+
366
+ ### Evaluation Dataset
367
+ **Link** <br>
368
+ * [See Section 6.1, "Benchmark"](https://arxiv.org/abs/2409.11402) <br>
369
+
370
+ **Data collection method by dataset** <br>
371
+ * Human <br>
372
+
373
+ **Labeling method by dataset** <br>
374
+ * Human <br>
375
+
376
+ **Properties** <br>
377
+ * Evaluated on general knowledge, visual answering, chart understanding, table, optical character recognition, and mathematical reasoning. <br>
378
+
379
 
380
  ## Correspondence to
381
  Wenliang Dai* ([email protected]), Nayeon Lee* ([email protected]), Boxin Wang* ([email protected]), Zhuolin Yang* ([email protected]), Wei Ping* ([email protected])
 
392
  </pre>
393
 
394
 
 
 
 
395
  ## Ethical Considerations
396
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
397