File size: 4,991 Bytes
3371fdb 36ef144 3371fdb e14d881 3371fdb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
---
license: other
license_name: intel-research-use-license
license_link: LICENSE
---
# LLaVA-Llama3 Model Card
_This model card corresponds to the instruction tuned 8B version of the model with the CLIP-based vision encoder._
## Overview
`llava-llama-3-8b` is a large multimodal model (LMM) trained using the [LLaVA-v1.5 framework](https://arxiv.org/abs/2310.03744) with the 8-billion parameter [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B) model as language backbone.
## Uses
The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.
## Bias, Risks, and Limitations
This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.
## Training Details
The `llava-llama-3-8b` model was trained on a 4 node cluster with a total of 32 Gaudi 2 accelerators.
### Training Data
The model was trained using the LLaVA-v1.5 data mixture.
This is listed as follows:
- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
- 158K GPT-generated multimodal instruction-following data.
- 450K academic-task-oriented VQA data mixture.
- 40K ShareGPT data.
## Evaluation
| Model | Metrics |
|----------|------------------|
| ScienceQA| 72.9797 |
| MMVet | 31.9725 |
| llavaw | 56.9/61.9/73.6/65.7 |
| Pope Acc | 87.33, F1 86.5 |
| GQA | 60.6138 |
| MMVP | 36 |
## License
The weights are released under the Intel Research Use License Agreement (see LICENSE file)
All usage code is licensed Apache 2.0
## Usage
Please note, we only provide the trained weights difference and do not provide a copy of the base [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B) model. Any use of these weights requires a separate download of the base model.
```python
# Copyright 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForPreTraining
import transformers
def expand2square(pil_img, background_color):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
def add_model_a_to_b(model_a, model_b):
state_dict_a = model_a.state_dict()
state_dict_b = model_b.state_dict()
# Ensure keys match before subtraction
if set(state_dict_a.keys()) != set(state_dict_b.keys()):
raise ValueError("Model state dicts do not have the same keys.")
for key in state_dict_a:
if state_dict_a[key].shape != state_dict_b[key].shape:
raise ValueError(f"Shape mismatch for key '{key}': {state_dict_a[key].shape} vs {state_dict_b[key].shape}")
# Subtract model_a's weights from model_b for the matching key
state_dict_b[key] = state_dict_b[key] + state_dict_a[key]
# Update model_b with the new weights
model_b.load_state_dict(state_dict_b)
output_checkpoint = "" # set if you don't want to merge every time
hf_checkpoint = "Intel/llava-llama-3-8b"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(hf_checkpoint)
model = AutoModelForPreTraining.from_pretrained(hf_checkpoint)
if model.language_model.model.embed_tokens.weight[-1].sum() == 0:
print("adding llama3 weights")
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="cpu",
)
llama3 = pipeline.model
add_model_a_to_b(llama3, model.language_model)
if output_checkpoint:
print("saving weights, so no adding is needed again")
model.save_pretrained(output_checkpoint)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
prompt = processor.tokenizer.apply_chat_template(
[{'role': 'user', 'content': "<image>\nWhat's the content of the image?"}],
tokenize=False,
add_generation_prompt=True
)
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)
#original llava pads with mean, HF llava pads with zeros
image = expand2square(image, tuple(int(x*255) for x in processor.image_processor.image_mean))
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
# Generate
generate_ids = model.generate(**inputs, max_length=30)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)
``` |