lamm-mit
/

Cephalo-Idefics2-vision-3x8b-beta

Model card Files Files and versions Community

mjbuehler commited on Jun 9, 2024

Commit

13d864f

verified ·

1 Parent(s): 919b160

Update README.md

Browse files

A few edits for improved clarity

Files changed (1) hide show

README.md +19 -15

README.md CHANGED Viewed

@@ -45,7 +45,7 @@ This version of Cephalo, lamm-mit/Cephalo-Idefics2-3x8b-beta, is a Mixture-of-Ex
 The model has 20b parameters (3 experts, each 8b each, 8b active parameters during inference).
-### Download Idefics-2 MoE Model and Sample inference code
 ```python
 pip install transformers -U
@@ -74,7 +74,7 @@ moe_model = AutoModelForCausalLM.from_pretrained(
 count_parameters(moe_model)
 ```
-Now use downloaded model for inference:
 ```python
 from transformers.image_utils import load_image
@@ -157,9 +157,12 @@ Download models that will form the experts, as well as the base model. As a simp
 2) A chatty version: HuggingFaceM4/idefics2-8b-chatty (model_1) (model_2)
 3) A basic variant: HuggingFaceM4/idefics2-8b (model_3)
 ```python
 from transformers import AutoProcessor, Idefics2ForConditionalGeneration , AutoTokenizer
 from transformers import BitsAndBytesConfig
 DEVICE='cuda'
@@ -210,6 +213,8 @@ model_3.to(DEVICE)
 Here we show how a MoE is constructed from the set of expert models loaded earlier. We consider three models, model_1, model_2 and model_3.
 ```python
 dtype = torch.bfloat16  # Desired dtype for new layers
 base_model = copy.deepcopy(model_1)  # Your base model
@@ -264,14 +269,14 @@ print(generated_texts)
 We train the gating layers by providing sample images/prompts for each of the three experts. Here is a simple example training set:
 ```python
-image_1 = Image.open("./VALIDATION/Q15.jpg")
-image_1a = Image.open("./VALIDATION/Q31.jpg")
-image_2 = Image.open(requests.get("https://media.wired.com/photos/5aa32b912ba43111d1213e0c/master/w_2240,c_limit/akhacouple.jpg", stream=True).raw)
-image_2a = Image.open(requests.get("https://media.wired.com/photos/5aa32b912ba43111d1213e0c/master/w_2240,c_limit/akhacouple.jpg", stream=True).raw)
-image_3 = Image.open(requests.get("https://i5.walmartimages.com/seo/Amazing-Andrea-Apple-Tree-Seeds-20-Seeds-Grow-Fresh-Apples_ff218043-bcd4-4437-8418-6631d8e97bb3.638ac0120ff05c8913e85ebb74f45f6c.jpeg?odnHeight=640&odnWidth=640&odnBg=FFFFFF", stream=True).raw)
-image_3a = Image.open(requests.get("https://i5.walmartimages.com/seo/Amazing-Andrea-Apple-Tree-Seeds-20-Seeds-Grow-Fresh-Apples_ff218043-bcd4-4437-8418-6631d8e97bb3.638ac0120ff05c8913e85ebb74f45f6c.jpeg?odnHeight=640&odnWidth=640&odnBg=FFFFFF", stream=True).raw)
 prompts_per_expert = [
     [{"text": "User:<image>What is shown in this image. Explain the importance for materials design.<end_of_utterance>Assistant: The image shows", "image": [image_1]},
@@ -282,21 +287,21 @@ prompts_per_expert = [
      {"text": "User:<image>What is shown in this image, and what does it mean in terms of human history? <end_of_utterance>Assistant: The image shows a historical image of human development.", "image": [image_2a]},
      ],
-     [{"text": "User:<image>What is shown in this image. Provide a brief answer. <end_of_utterance>Assistant: This is an apple.", "image": [image_3]},
      {"text": "User:<image>What is shown in this image. Brief and concise answer. <end_of_utterance>Assistant: The image shows an apple.", "image": [image_3a]},
      ],
 ]
-gating_layer_params = moe_model.train_gating_layer_params_from_hidden_states(processor, prompts_per_expert,
-                                              epochs=1000, loss_steps=100,  lr=5e-5, layer_offset=0)
-# Set parameters for a specific layer
 moe_model.set_gating_layer_params(gating_layer_params)
 ```
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/mh4eFDuFsTBOYbjc38PYz.png)
 Now that the MoE model has been trained, we can try inference.  Inference after MoE gating layers are trained:
 ```python
@@ -324,7 +329,7 @@ inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
 generated_ids = moe_model.generate(**inputs, max_new_tokens=500)
 generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
-print(generated_texts)
 ```
 ### Push to hub and save locally
@@ -343,7 +348,6 @@ Save locally:
 ```python
 processor.save_pretrained(moe_name, )
 moe_model.save_pretrained(moe_name,  )
 ```
 Loading the model works as done above. Here included again for completeness:

 The model has 20b parameters (3 experts, each 8b each, 8b active parameters during inference).
+## Download Idefics-2 MoE Model and Sample inference code
 ```python
 pip install transformers -U
 count_parameters(moe_model)
 ```
+Now use the downloaded MoE model for inference:
 ```python
 from transformers.image_utils import load_image
 2) A chatty version: HuggingFaceM4/idefics2-8b-chatty (model_1) (model_2)
 3) A basic variant: HuggingFaceM4/idefics2-8b (model_3)
+One (or another model) must be used as base model, from which the vision model, connector, self-attention, etc. are used. From the list of models provided as experts, the feed forward layers are used. Each model will become one expert.
 ```python
 from transformers import AutoProcessor, Idefics2ForConditionalGeneration , AutoTokenizer
 from transformers import BitsAndBytesConfig
+from Idefics2_MoE.moe_idefics2 import *
 DEVICE='cuda'
 Here we show how a MoE is constructed from the set of expert models loaded earlier. We consider three models, model_1, model_2 and model_3.
+First, we designate the base model (here we use a deep copy of model_1) and the list of experts. We first create a config, then the moe_model. The config is based on the Idefics2 config from model_1, loaded above.
 ```python
 dtype = torch.bfloat16  # Desired dtype for new layers
 base_model = copy.deepcopy(model_1)  # Your base model
 We train the gating layers by providing sample images/prompts for each of the three experts. Here is a simple example training set:
 ```python
+image_1 = Image.open("./Image_1.jpg")
+image_1a =Image.open("./Image_1b.jpg")
+image_2 = Image.open("./Image_2.jpg")
+image_2a =Image.open("./Image_2b.jpg")
+image_3 = Image.open("./Image_3.jpg")
+image_3a =Image.open("./Image_3b.jpg")
 prompts_per_expert = [
     [{"text": "User:<image>What is shown in this image. Explain the importance for materials design.<end_of_utterance>Assistant: The image shows", "image": [image_1]},
      {"text": "User:<image>What is shown in this image, and what does it mean in terms of human history? <end_of_utterance>Assistant: The image shows a historical image of human development.", "image": [image_2a]},
      ],
+     [{"text": "User:<image>What is shown in this image. Provide a brief answer. <end_of_utterance>Assistant: This is an apple, a fruit with good flavor.", "image": [image_3]},
      {"text": "User:<image>What is shown in this image. Brief and concise answer. <end_of_utterance>Assistant: The image shows an apple.", "image": [image_3a]},
      ],
 ]
+gating_layer_params = moe_model.train_gating_layer_params_from_hidden_states(processor,
+                                              prompts_per_expert,
+                                              epochs=1000, loss_steps=100,  lr=5e-5, )
+# Set parameters for a specific layer
 moe_model.set_gating_layer_params(gating_layer_params)
 ```
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/mh4eFDuFsTBOYbjc38PYz.png)
 Now that the MoE model has been trained, we can try inference.  Inference after MoE gating layers are trained:
 ```python
 generated_ids = moe_model.generate(**inputs, max_new_tokens=500)
 generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
+print(generated_texts[0])
 ```
 ### Push to hub and save locally
 ```python
 processor.save_pretrained(moe_name, )
 moe_model.save_pretrained(moe_name,  )
 ```
 Loading the model works as done above. Here included again for completeness: