VictorSanh commited on
Commit
a2ee636
·
1 Parent(s): e434afa

add instruct model card

Browse files
Files changed (1) hide show
  1. README.md +89 -8
README.md CHANGED
@@ -60,7 +60,11 @@ The following screenshot is an example of interaction with the instructed model:
60
 
61
  # How to Get Started with the Model
62
 
63
- Use the code below to get started with the model.
 
 
 
 
64
 
65
  ```python
66
  import torch
@@ -93,10 +97,50 @@ for i, t in enumerate(generated_text):
93
 
94
  To quickly test your software without waiting for the huge model to download/load you can use `HuggingFaceM4/tiny-random-idefics` - it hasn't been trained and has random weights but it is very useful for quick testing.
95
 
96
- This [tutorial](https://github.com/huggingface/notebooks/pull/418/) shows a simple example to fine-tune IDEFICS on custom data. This [colab notebook](TODO) showcases how to do the fine-tuning in 4bits precision. TODO: change to the correct link once it's merged.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
  # Training Details
99
 
 
 
100
  We closely follow the training procedure layed out in [Flamingo](https://huggingface.co/papers/2204.14198). We combine two open-source pre-trained models ([laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)) by initializing new Transformer blocks. The pre-trained backbones are frozen while we train the newly initialized parameters.
101
 
102
  The model is trained on the following data mixture of openly accessible English data:
@@ -123,7 +167,7 @@ Following [Dehghani et al., 2023](https://huggingface.co/papers/2302.05442), we
123
  The training objective is the standard next token prediction.
124
 
125
  We use the following hyper and training parameters:
126
- | Parameters | | IDEFICS | IDEFICS-9b |
127
  | -- | -- | -- | -- |
128
  | Perceiver Resampler | Number of Layers | 6 | 6 |
129
  | | Number of Latents | 64 | 64 |
@@ -147,9 +191,46 @@ We use the following hyper and training parameters:
147
  | | Precision | Mixed-pres bf16 | Mixed-pres bf16 |
148
  | | ZeRO Optimization | Stage 3 | Stage 3 |
149
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
 
151
  # Evaluation
152
 
 
 
153
  We follow the evaluation protocol of Flamingo and evaluate IDEFICS on a suite of downstream image-text benchmarks ranging from visual question answering to image captioning.
154
 
155
  We compare our model to the original Flamingo along with [OpenFlamingo](openflamingo/OpenFlamingo-9B-vitl-mpt7b), another open-source reproduction.
@@ -201,17 +282,17 @@ Fairness Evaluations:
201
  | | 16 | 95.8 | 43.0 | 46.1 |
202
  | | 32 | 96.1 | 35.1 | 44.9 |
203
 
 
204
 
 
205
 
206
- # Technical Specifications
207
 
208
- - **Hardware Type:** 64 nodes of 8x 80GB A100 gpus, EFA network
209
- - **Hours used:** ~672 node hours
210
- - **Cloud Provider:** AWS Sagemaker
211
 
212
  ## Hardware
213
 
214
- The training was performed on an AWS SageMaker cluster with 64 nodes of 8x80GB A100 GPUs (512 GPUs total). The cluster uses the current EFA network which provides about 340GBps throughput.
215
 
216
  ## Software
217
 
 
60
 
61
  # How to Get Started with the Model
62
 
63
+ This [tutorial](https://github.com/huggingface/notebooks/pull/418/) shows a simple example to fine-tune IDEFICS on custom data. This [colab notebook](TODO) showcases how to do the fine-tuning in 4bits precision. TODO: change to the correct link once it's merged.
64
+
65
+ We provide quick-start code for both the base and the instruct models.
66
+
67
+ Use the code below to get started with the base model.
68
 
69
  ```python
70
  import torch
 
97
 
98
  To quickly test your software without waiting for the huge model to download/load you can use `HuggingFaceM4/tiny-random-idefics` - it hasn't been trained and has random weights but it is very useful for quick testing.
99
 
100
+ Use that code to get started with the instruct model:
101
+ ```python
102
+ import torch
103
+ from transformers import IdeficsForVisionText2Text, AutoProcessor
104
+
105
+ device = "cuda" if torch.cuda.is_available() else "cpu"
106
+
107
+ checkpoint = "HuggingFaceM4/idefics-9b-instruct"
108
+ model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
109
+ processor = AutoProcessor.from_pretrained(checkpoint)
110
+
111
+ # We feed to the model an arbitrary sequence of text strings and images. Images can be either URLs or PIL Images.
112
+ prompts = [
113
+ [
114
+ "User: What is in this image?",
115
+ "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
116
+ "<end_of_utterance>",
117
+
118
+ "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",
119
+
120
+ "\nUser:",
121
+ "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
122
+ "And who is that?<end_of_utterance>",
123
+
124
+ "\nAssistant:",
125
+ ],
126
+ ]
127
+
128
+ # --batched mode
129
+ inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device)
130
+ # --single sample mode
131
+ # inputs = processor(prompts[0], return_tensors="pt").to(device)
132
+ exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids
133
+
134
+ generated_ids = model.generate(**inputs, eos_token_id=exit_condition, max_length=100)
135
+ generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
136
+ for i, t in enumerate(generated_text):
137
+ print(f"{i}:\n{t}\n")
138
+ ```
139
 
140
  # Training Details
141
 
142
+ ## IDEFICS base
143
+
144
  We closely follow the training procedure layed out in [Flamingo](https://huggingface.co/papers/2204.14198). We combine two open-source pre-trained models ([laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)) by initializing new Transformer blocks. The pre-trained backbones are frozen while we train the newly initialized parameters.
145
 
146
  The model is trained on the following data mixture of openly accessible English data:
 
167
  The training objective is the standard next token prediction.
168
 
169
  We use the following hyper and training parameters:
170
+ | Parameters | | IDEFICS-80b | IDEFICS-9b |
171
  | -- | -- | -- | -- |
172
  | Perceiver Resampler | Number of Layers | 6 | 6 |
173
  | | Number of Latents | 64 | 64 |
 
191
  | | Precision | Mixed-pres bf16 | Mixed-pres bf16 |
192
  | | ZeRO Optimization | Stage 3 | Stage 3 |
193
 
194
+ ## IDEFICS-instruct
195
+
196
+ We start from the base IDEFICS models and fine-tune the models by unfreezing all the parameters (vision encoder, language model, cross-attentions). The mixture is composed of following English datasets:
197
+
198
+ | Data Source | Data Description | Number of unrepeated samples | Sampling ratio |
199
+ |-------------|----------------------------------------------|------------------------------|----------------|
200
+ | [M3IT](https://huggingface.co/datasets/MMInstruction/M3IT) | Prompted image-text academic datasets | 1.5M | 7.7% |
201
+ | [LRV-Instruction](https://huggingface.co/datasets/VictorSanh/LrvInstruction) | Triplets of image/question/answer | 155K | 1.7% |
202
+ | [LLaVA-Instruct](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) | Dialogues of question/answers grounded on an image | 158K | 5.9% |
203
+ | [LLaVAR-Instruct](https://huggingface.co/datasets/SALT-NLP/LLaVAR) | Dialogues of question/answers grounded on an image with a focus on images containing text | 15.5K | 6.3% |
204
+ | [SVIT](https://huggingface.co/datasets/BAAI/SVIT) | Triplets of image/question/answer | 3.2M | 11.4% |
205
+ | [Spot Difference](TODO) | Triplets of image/question/answer | 158K | 2.1% |
206
+ | [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) | Multi-turn text-only dialogye | 1.5M | 29.1% |
207
+
208
+ We note that all these datasets were obtained by using ChatGPT/GPT-4 in one way or another.
209
+
210
+ Additionally, we found it beneficial to include the pre-training data in the fine-tuning with the following sampling ratios: 5.1% of image-text pairs and 31.0 of multimodal web documents.
211
+
212
+ The training objective is the standard next token prediction. We use the following hyper and training parameters:
213
+ | Parameters | | IDEFICS-80b-instruct | IDEFICS-9b-instruct |
214
+ | -- | -- | -- | -- |
215
+ | Training | Sequence Length | 2048 | 2048 |
216
+ | | Effective Batch Size (# of tokens) | 613K | 205K |
217
+ | | Max Training Steps | 22K | 22K |
218
+ | | Weight Decay | 0.1 | 0.1 |
219
+ | | Optimizer | Adam(0.9, 0.999) | Adam(0.9, 0.999) |
220
+ | | Gradient Clipping | 1.0 | 1.0 |
221
+ | | [Z-loss](https://huggingface.co/papers/2204.02311) weight | 0. | 0. |
222
+ | Learning Rate | Initial Max | 3e-6 | 1e-5 |
223
+ | | Initial Final | 3.6e-7 | 1.2e-6 |
224
+ | | Decay Schedule | Linear | Linear |
225
+ | | Linear warmup Steps | 1K | 1K |
226
+ | Large-scale Optimization | Gradient Checkpointing | True | True |
227
+ | | Precision | Mixed-pres bf16 | Mixed-pres bf16 |
228
+ | | ZeRO Optimization | Stage 3 | Stage 3 |
229
 
230
  # Evaluation
231
 
232
+ ## IDEFICS base
233
+
234
  We follow the evaluation protocol of Flamingo and evaluate IDEFICS on a suite of downstream image-text benchmarks ranging from visual question answering to image captioning.
235
 
236
  We compare our model to the original Flamingo along with [OpenFlamingo](openflamingo/OpenFlamingo-9B-vitl-mpt7b), another open-source reproduction.
 
282
  | | 16 | 95.8 | 43.0 | 46.1 |
283
  | | 32 | 96.1 | 35.1 | 44.9 |
284
 
285
+ ## IDEFICS instruct
286
 
287
+ Similarly to the base IDEFICS models, we performed checkpoint selection to stop the training. Given that M3IT contains in the training set a handful of the benchmarks we were evaluating on, we used [MMBench](https://huggingface.co/papers/2307.06281) as a held-out validation benchmark to perform checkpoint selection. We select the checkpoint at step 3'000 for IDEFICS-80b-instruct and at step 8'000 for IDEFICS-9b-instruct.
288
 
289
+ TODO: tables comparing IDEFICS vs IDEFICS-instruct.
290
 
291
+ # Technical Specifications
 
 
292
 
293
  ## Hardware
294
 
295
+ The IDEFICS models were trained on an AWS SageMaker cluster using at the maximum 64 nodes of 8x80GB A100 GPUs (512 GPUs total). The cluster uses the current EFA network. IDEFICS-80b was trained for approximately 672 node hours. IDEFICS-80b-instruct was trained for approximately 3 days on 48 nodes.
296
 
297
  ## Software
298