---
license: other
license_name: intel-research-use-license
license_link: LICENSE
---

# LLaVA-Llama3 Model Card

_This model card corresponds to the instruction tuned 8B version of the model with the CLIP-based vision encoder._


## Overview

`llava-llama-3-8b` is a large multimodal model (LMM) trained using the [LLaVA-v1.5 framework](https://arxiv.org/abs/2310.03744) with the 8-billion parameter [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B) model as language backbone.

## Uses

The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.

## Bias, Risks, and Limitations

This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.

## Training Details

The `llava-llama-3-8b` model was trained on a 4 node cluster with a total of 32 Gaudi 2 accelerators.

### Training Data

The model was trained using the LLaVA-v1.5 data mixture.

This is listed as follows:

- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
- 158K GPT-generated multimodal instruction-following data.
- 450K academic-task-oriented VQA data mixture.
- 40K ShareGPT data.

## Evaluation

| Model    | Metrics          |
|----------|------------------|
| ScienceQA| 72.9797          |
| MMVet    | 31.9725          |
| llavaw   | 56.9/61.9/73.6/65.7 |
| Pope Acc | 87.33, F1 86.5   |
| GQA      | 60.6138          |
| MMVP     | 36               |

## License
The weights are released under the Intel Research Use License Agreement (see LICENSE file)  
All usage code is licensed Apache 2.0

## Usage

Please note, we only provide the trained weights difference and do not provide a copy of the base [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B) model.  Any use of these weights requires a separate download of the base model.

```
# Copyright 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForPreTraining
import transformers

def expand2square(pil_img, background_color):
    width, height = pil_img.size
    if width == height:
        return pil_img
    elif width > height:
        result = Image.new(pil_img.mode, (width, width), background_color)
        result.paste(pil_img, (0, (width - height) // 2))
        return result
    else:
        result = Image.new(pil_img.mode, (height, height), background_color)
        result.paste(pil_img, ((height - width) // 2, 0))
        return result

def add_model_a_to_b(model_a, model_b):
    state_dict_a = model_a.state_dict()
    state_dict_b = model_b.state_dict()

    # Ensure keys match before subtraction
    if set(state_dict_a.keys()) != set(state_dict_b.keys()):
        raise ValueError("Model state dicts do not have the same keys.")

    for key in state_dict_a:
        if state_dict_a[key].shape != state_dict_b[key].shape:
            raise ValueError(f"Shape mismatch for key '{key}': {state_dict_a[key].shape} vs {state_dict_b[key].shape}")
        # Subtract model_a's weights from model_b for the matching key
        state_dict_b[key] = state_dict_b[key] + state_dict_a[key]

    # Update model_b with the new weights
    model_b.load_state_dict(state_dict_b)

output_checkpoint = "" # set if you don't want to merge every time
hf_checkpoint = "Intel/llava-llama-3-8b"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(hf_checkpoint)
model = AutoModelForPreTraining.from_pretrained(hf_checkpoint)
if model.language_model.model.embed_tokens.weight[-1].sum() == 0:
    print("adding llama3 weights")
    model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
    pipeline = transformers.pipeline(
        "text-generation",
        model=model_id,
        model_kwargs={"torch_dtype": torch.bfloat16},
        device_map="cpu",
    )
    llama3 = pipeline.model
    add_model_a_to_b(llama3, model.language_model)
    if output_checkpoint:
        print("saving weights, so no adding is needed again")
        model.save_pretrained(output_checkpoint)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

prompt = processor.tokenizer.apply_chat_template(
    [{'role': 'user', 'content': "<image>\nWhat's the content of the image?"}],
    tokenize=False,
    add_generation_prompt=True
)

url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)

#original llava pads with mean, HF llava pads with zeros
image = expand2square(image, tuple(int(x*255) for x in processor.image_processor.image_mean)) 
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
# Generate
generate_ids = model.generate(**inputs, max_length=30)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)
```