zouhsab commited on
Commit
d9f11c5
·
verified ·
1 Parent(s): b902ca2

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -311
README.md DELETED
@@ -1,311 +0,0 @@
1
- <h2 align="center"> <a href="https://arxiv.org/abs/2402.14289">TinyLLaVA: A Framework of Small-scale Large Multimodal Models</a>
2
-
3
- <h5 align="center">
4
-
5
- [![hf_space](https://img.shields.io/badge/🤗-%20Open%20In%20HF-blue.svg)](https://huggingface.co/bczhou/TinyLLaVA-3.1B) [![arXiv](https://img.shields.io/badge/Arxiv-2402.14289-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2402.14289) [![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://github.com/PKU-YuanGroup/MoE-LLaVA/blob/main/LICENSE)
6
-
7
-
8
- ## &#x1F389; News
9
- * **[2024.03.10]** base recipe out!
10
- * **[2024.03.10]** Finetune scripts out!
11
- * **[2024.02.25]** Update evaluation scripts and docs!
12
- * **[2024.02.25]** Data descriptions out. Release TinyLLaVA-1.5B and TinyLLaVA-2.0B!
13
- * **[2024.02.24]** Example code on inference and model loading added!
14
- * **[2024.02.23]** Evaluation code and scripts released!
15
- * **[2024.02.21]** Creating the [TinyLLaVABench](https://github.com/DLCV-BUAA/TinyLLavaBench) repository on GitHub!
16
- * **[2024.02.21]** Our paper: [TinyLLaVA: A Framework of Small-scale Large Multimodal Models](https://arxiv.org/abs/2402.14289) is out!
17
- * **[2024.01.11]** Our fist model [TinyLLaVA-1.4B](https://huggingface.co/bczhou/tiny-llava-v1-hf) is out!
18
-
19
- ## &#x231B; TODO
20
- - [ ] Add support for Ollama and llama.cpp.
21
- - [x] Developers' guide / How to build demo locally.
22
- - [x] Training and custom finetuning docs.
23
- - [x] Model Zoo descriptions.
24
- - [x] Examples and inference.
25
- - [x] Release code for training.
26
- - [x] Add descriptions for evaluation.
27
- - [x] Add descriptions for data preparation.
28
- - [x] Release TinyLLaVA-1.5B and TinyLLaVA-2.0B.
29
- - [x] Release TinyLLaVA-3.1B.
30
- - [x] Release the evaluation code and weights today(2024.2.23).
31
- ### &#x1F525; High performance, but with fewer parameters
32
-
33
- - Our best model, TinyLLaVA-3.1B, achieves better overall performance against existing 7B models such as LLaVA-1.5 and Qwen-VL.
34
-
35
- ## Contents
36
-
37
- - [Install](#x1f527-requirements-and-installation)
38
- - [Model Zoo](#x1f433-model-zoo)
39
- - [Demo](#Demo)
40
- - [Quick Start](#x1f527-quick-start)
41
- - [Run Inference](#x1f527-run-inference)
42
- - [Evaluation](#evaluation)
43
- - [Data](#data-preparation)
44
- - [Train](#train)
45
- - [Custom Finetune](#custom-finetune)
46
-
47
-
48
- ## &#x1F527; Requirements and Installation
49
-
50
- We recommend the requirements as follows.
51
-
52
- 1. Clone this repository and navigate to LLaVA folder
53
- ```bash
54
- git clone https://github.com/DLCV-BUAA/TinyLLaVABench.git
55
- cd TinyLLaVABench
56
- ```
57
-
58
- 2. Install Package
59
- ```Shell
60
- conda create -n tinyllava python=3.10 -y
61
- conda activate tinyllava
62
- pip install --upgrade pip # enable PEP 660 support
63
- pip install -e .
64
- ```
65
-
66
- 3. Install additional packages for training cases
67
- ```Shell
68
- pip install -e ".[train]"
69
- pip install flash-attn --no-build-isolation
70
- ```
71
- ### Upgrade to the latest code base
72
-
73
- ```Shell
74
- git pull
75
- pip install -e .
76
-
77
- # if you see some import errors when you upgrade, please try running the command below (without #)
78
- # pip install flash-attn --no-build-isolation --no-cache-dir
79
- ```
80
-
81
- ## &#x1F433; Model Zoo
82
- ### Legacy Model
83
- - [tiny-llava-hf](https://huggingface.co/bczhou/tiny-llava-v1-hf)
84
-
85
- ### Pretrained Models
86
- - [TinyLLaVA-3.1B](https://huggingface.co/bczhou/TinyLLaVA-3.1B)
87
- - [TinyLLaVA-2.0B](https://huggingface.co/bczhou/TinyLLaVA-2.0B)
88
- - [TinyLLaVA-1.5B](https://huggingface.co/bczhou/TinyLLaVA-1.5B)
89
-
90
- ### Model Details
91
- | Name | LLM | Checkpoint | LLaVA-Bench-Wild | MME | MMBench | MM-Vet | SQA-image | VQA-v2 | GQA | TextVQA |
92
- |---------------|-------------------|------------------------------------------------|------------------|----------|---------|--------|-----------|--------|-------|---------|
93
- | TinyLLaVA-3.1B | Phi-2 | [TinyLLaVA-3.1B](https://huggingface.co/bczhou/TinyLLaVA-3.1B) | 75.8 | 1464.9 | 66.9 | 32.0 | 69.1 | 79.9 | 62.0 | 59.1 |
94
- | TinyLLaVA-2.0B | StableLM-2-1.6B | [TinyLLaVA-2.0B](https://huggingface.co/bczhou/TinyLLaVA-2.0B) | 66.4 | 1433.8 | 63.3 | 32.6 | 64.7 | 78.9 | 61.9 | 56.4 |
95
- | TinyLLaVA-1.5B | TinyLlama | [TinyLLaVA-1.5B](https://huggingface.co/bczhou/TinyLLaVA-1.5B) | 60.8 | 1276.5 | 55.2 | 25.8 | 60.3 | 76.9 | 60.3 | 51.7 |
96
-
97
-
98
- ## Demo
99
-
100
- ### Gradio Web Demo
101
-
102
- Launch a local web demo by running:
103
- ```shell
104
- python tinyllava/serve/app.py --model-path bczhou/TinyLLaVA-3.1B --model-name TinyLLaVA-3.1B
105
- ```
106
-
107
- ### CLI Inference
108
-
109
- We also support running inference with CLI. To use our model, run:
110
- ```shell
111
- python -m tinyllava.serve.cli \
112
- --model-path bczhou/TinyLLaVA-3.1B \
113
- --image-file "./tinyllava/serve/examples/extreme_ironing.jpg"
114
- ```
115
-
116
-
117
- ## &#x1F527; Quick Start
118
-
119
- <details>
120
- <summary>Load model</summary>
121
-
122
- ```Python
123
- from tinyllava.model.builder import load_pretrained_model
124
- from tinyllava.mm_utils import get_model_name_from_path
125
- from tinyllava.eval.run_tiny_llava import eval_model
126
-
127
- model_path = "bczhou/TinyLLaVA-3.1B"
128
-
129
- tokenizer, model, image_processor, context_len = load_pretrained_model(
130
- model_path=model_path,
131
- model_base=None,
132
- model_name=get_model_name_from_path(model_path)
133
- )
134
- ```
135
- </details>
136
-
137
- ## &#x1F527; Run Inference
138
- Here's an example of running inference with [TinyLLaVA-3.1B](https://huggingface.co/bczhou/TinyLLaVA-3.1B)
139
- <details>
140
- <summary>Run Inference</summary>
141
-
142
- ```Python
143
- from tinyllava.model.builder import load_pretrained_model
144
- from tinyllava.mm_utils import get_model_name_from_path
145
- from tinyllava.eval.run_tiny_llava import eval_model
146
-
147
- model_path = "bczhou/TinyLLaVA-3.1B"
148
- prompt = "What are the things I should be cautious about when I visit here?"
149
- image_file = "https://llava-vl.github.io/static/images/view.jpg"
150
-
151
- args = type('Args', (), {
152
- "model_path": model_path,
153
- "model_base": None,
154
- "model_name": get_model_name_from_path(model_path),
155
- "query": prompt,
156
- "conv_mode": "phi",
157
- "image_file": image_file,
158
- "sep": ",",
159
- "temperature": 0,
160
- "top_p": None,
161
- "num_beams": 1,
162
- "max_new_tokens": 512
163
- })()
164
-
165
- eval_model(args)
166
- ```
167
- </details>
168
-
169
- ### Important
170
- We use different `conv_mode` for different models. Replace the `conv_mode` in `args` according to this table:
171
- | model | conv_mode |
172
- |---------------- |----------- |
173
- | TinyLLaVA-3.1B | phi |
174
- | TinyLLaVA-2.0B | phi |
175
- | TinyLLaVA-1.5B | v1 |
176
-
177
- ## Evaluation
178
- To ensure the reproducibility, we evaluate the models with greedy decoding.
179
-
180
- See [Evaluation.md](https://github.com/DLCV-BUAA/TinyLLaVABench/blob/main/docs/Evaluation.md)
181
-
182
- ## Data Preparation
183
-
184
- In our paper, we used two different datasets: the [LLaVA dataset](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#pretrain-feature-alignment) and the [ShareGPT4V dataset](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md), and compared their differences. In this section, we provide information on data preparation.
185
-
186
- ### Pretraining Images
187
- * LLaVA: The pretraining images of LLaVA is from the 558K subset of the LAION-CC-SBU dataset.
188
- * ShareGPT4V: The pretraining images of ShareGPT4V is a mixture of 558K LAION-CC-SBU subset, SAM dataset, and COCO dataset.
189
-
190
- ### Pretraining Annotations
191
- * LLaVA: The pretraining annotations of LLaVA are [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain).
192
- * ShareGPT4V: The pretraining annotations of ShareGPT4V are [here](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/share-captioner_coco_lcs_sam_1246k_1107.json).
193
-
194
-
195
- ### SFT Images & Annotations
196
- The majority of the two SFT datasets are the same, with the exception that the 23K detailed description data in LLaVA-1.5-SFT being replaced with detailed captions randomly sampled from the [100K ShareGPT4V data](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_instruct_gpt4-vision_cap100k.json).
197
-
198
- ### Download data
199
-
200
- 1. Download relevant images
201
-
202
- - LAION-CC-SBU-558K: [images.zip](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/blob/main/images.zip)
203
- - COCO: This dataset is from the [COCO2017 challenge](https://cocodataset.org/). Download: [train2017](http://images.cocodataset.org/zips/train2017.zip)
204
- - WebData: This dataset is curated by the [ShareGPT4V project](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V). Download: [images](https://drive.google.com/drive/folders/1tCUQ-sq6vdshZVkF0ZeF3K4eztkXJgax?usp=sharing). Only for academic usage.
205
- - SAM: This dataset is collected by [Meta](https://ai.meta.com/datasets/segment-anything-downloads/). Download: [images](https://ai.meta.com/datasets/segment-anything-downloads/). We only use 000000~000050.tar for now. If you just want to use ShareGPT4V for SFT, you can quickly download 9K images from [here](https://drive.google.com/file/d/1dKumdOKSXtV7lIXdrG7jsIK_z2vZv2gs/view?usp=drive_link).
206
- - GQA: [GQA project page](https://cs.stanford.edu/people/dorarad/gqa/about.html). Download: [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
207
- - OCR-VQA: [OCR-VQA project page](https://ocr-vqa.github.io/). Download: [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing). We save all files as `.jpg`
208
- - TextVQA: [TextVQA project page](https://textvqa.org/). Download: [trainvalimages](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
209
- - VisualGenome: [VisualGenome project page](https://homes.cs.washington.edu/~ranjay/visualgenome/index.html). Download: [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)
210
-
211
-
212
- 2. Download relevant annotations
213
-
214
- - LLaVA's pretraining annotations: [blip_laion_cc_sbu_558k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)
215
- - LLaVA's SFT annotations: [llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json)
216
- - ShareGPT4V's pretraining annotations: [share-captioner_coco_lcs_sam_1246k_1107.json](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/share-captioner_coco_lcs_sam_1246k_1107.json)
217
- - ShareGPT4V's SFT annotations: [sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json)
218
-
219
-
220
- ### Organize Data
221
-
222
- Organize the image files and annotation files as follows in `path/to/your/data`:
223
-
224
- ```none
225
- data
226
- ├── llava
227
- │ ├── llava_pretrain
228
- │ │ ├── images
229
- │ │ ├── blip_laion_cc_sbu_558k.json
230
- ├── coco
231
- │ ├── train2017
232
- ├── sam
233
- │ ├── images
234
- ├── gqa
235
- │ ├── images
236
- ├── ocr_vqa
237
- │ ├── images
238
- ├── textvqa
239
- │ ├── train_images
240
- ├── vg
241
- │ ├── VG_100K
242
- │ ├── VG_100K_2
243
- ├── share_textvqa
244
- │ ├── images
245
- ├── web-celebrity
246
- │ ├── images
247
- ├── web-landmark
248
- │ ├── images
249
- ├── wikiart
250
- │ ├── images
251
- ├── text_files
252
- │ ├── llava_v1_5_mix665k.json
253
- │ ├── share-captioner_coco_lcs_sam_1246k_1107.json
254
- │ ├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
255
- ```
256
-
257
- ## Train
258
-
259
- **This section we describe the base recipe.**
260
- ### Hyperparameters
261
- Both hyperparameters used in pretraining and finetuning are provided below.
262
-
263
- 1. Pretraining
264
-
265
- | Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
266
- |----------------| ---: | ---: | ---: |-----------:| ---: |
267
- | TinyLLaVA-3.1B | 256 | 1e-3 | 1 | 3072 | 0 |
268
-
269
- 2. Finetuning
270
-
271
- | Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
272
- |----------------| ---: | ---: | ---: |-----------:| ---: |
273
- | TinyLLaVA-3.1B | 128 | 2e-5 | 1 | 3072 | 0 |
274
-
275
- ### Pretrain
276
-
277
- **Replace paths to your paths**
278
-
279
- Training script with DeepSpeed ZeRO-2: [`pretrain.sh`](https://github.com/DLCV-BUAA/TinyLLaVABench/blob/main/scripts/tiny_llava/pretrain.sh).
280
-
281
- ### Finetune
282
-
283
- **Replace paths to your paths**
284
-
285
- Training script with DeepSpeed ZeRO-3: [`finetune.sh`](https://github.com/DLCV-BUAA/TinyLLaVABench/blob/main/scripts/tiny_llava/finetune.sh).
286
-
287
- ## Custom-Finetune
288
-
289
- Check out our custom finetune using LoRA [here](https://github.com/DLCV-BUAA/TinyLLaVABench/blob/dev/docs/CUTOM_FINETUNE.md).
290
-
291
-
292
- ## &#x270F; Citation
293
-
294
- If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.
295
-
296
- ```BibTeX
297
- @misc{zhou2024tinyllava,
298
- title={TinyLLaVA: A Framework of Small-scale Large Multimodal Models},
299
- author={Baichuan Zhou and Ying Hu and Xi Weng and Junlong Jia and Jie Luo and Xien Liu and Ji Wu and Lei Huang},
300
- year={2024},
301
- eprint={2402.14289},
302
- archivePrefix={arXiv},
303
- primaryClass={cs.LG}
304
- }
305
- ```
306
-
307
-
308
- ## ❤️ Community efforts
309
- * Our codebase is built upon the [LLaVA](https://github.com/haotian-liu/LLaVA) project. Great work!
310
- * Our project uses data from the [ShareGPT4V](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V) project. Great work!
311
-