Llama-3.2-11B-Vision-Instruct-TagRater

The Llama-3.2-11B-Vision-Instruct-TagRater is a merged multi-modal model designed to rate images based on a provided tagword. By combining visual and language understanding, this model evaluates an image against a rating prompt and produces a concise explanation along with a relevance rating from 0 to 5.

Model Details

Base Model: unsloth/Llama-3.2-11B-Vision-Instruct
Architecture: FastVisionModel
Fine-Tuning: Utilizes LoRA fine-tuning on vision layers, language layers, attention modules, and MLP modules.
Quantization: Loaded in 4-bit mode for improved memory efficiency.
Merged Model: Fully merged and ready to use without any additional assembly steps.

Training Overview

Data Preparation: Training images were resized to 512×512 pixels. Each sample pairs an image with a prompt that directs the model to evaluate how well the image matches a given search term.
Fine-Tuning Strategy: Both vision and language components were fine-tuned using a supervised fine-tuning (SFT) approach with LoRA parameters (e.g., r=16, lora_alpha=16).
Workflow: A conversational data format was used during training, combining both textual and visual inputs, with techniques such as gradient checkpointing and 8-bit AdamW optimizations to enhance efficiency.

Text Instruction for Rating

During training, the following text instruction was used to guide the model in rating images based on the provided tagword:

Evaluate how well this image matches the search term: [tagword] . Provide a concise reason and assign a 0–5 relevance score:

Scoring (0–5):
0 – Not Relevant: No connection.
1 – Barely Relevant: Very weak or vague link.
2 – Minimally Relevant: Hints but lacks clarity.
3 – Moderately Relevant: Noticeable link, not the main focus.
4 – Highly Relevant: Strong, clear representation.
5 – Perfectly Relevant: Ideal example.

Content Relevance: Does it clearly relate?
Context & Setting: Does its overall style fit the theme?
Visual Appeal & User Satisfaction: Would users find this image useful or satisfying based on the search term?

Performance Metrics

During inference, the model has demonstrated the following performance:

Tokens Generated: 233
Tokens per Second: 40.65
VRAM Allocated: 7834.95 MB
VRAM Reserved: 8862.00 MB
RAM Usage: 2281.43 MB

These metrics provide insights into the model’s computational efficiency during operation.

Usage

from unsloth import FastVisionModel
import base64
import json
import re
from io import BytesIO
from PIL import Image
import torch

# --- Model Initialization ---
model, tokenizer = FastVisionModel.from_pretrained(
    model_name="Pixuai/Llama-3.2-11B-Vision-Instruct-TagRater",
    load_in_4bit=True,
    max_seq_length=150,
)
FastVisionModel.for_inference(model)

# --- Input Preparation ---
# Replace these variables with your actual prompt and base64 encoded image string.
prompt = "Evaluate how well this image matches the search term: space explorer . Provide a concise reason and assign a 0–5 relevance score: Scoring (0–5): 0 – Not Relevant: No connection. 1 – Barely Relevant: Very weak or vague link. 2 – Minimally Relevant: Hints but lacks clarity. 3 – Moderately Relevant: Noticeable link, not the main focus. 4 – Highly Relevant: Strong, clear representation. 5 – Perfectly Relevant: Ideal example. - Content Relevance: Does it clearly relate? - Context & Setting: Does its overall style fit the theme? - Visual Appeal & User Satisfaction: Would users find this image useful or satisfying based on the search term?"
image_b64 = "base64_string_here"  # Replace with a valid base64 encoded image string

try:
    image_data = base64.b64decode(image_b64)
    image = Image.open(BytesIO(image_data)).convert("RGB")
except Exception as e:
    print(f"Error decoding image: {str(e)}")
    exit()

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt}
        ],
    }
]

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt",
).to("cuda")

# --- Inference ---
gen_tokens = model.generate(
    **inputs,
    max_new_tokens=100,
    use_cache=True,
    temperature=0.1,
    min_p=0.1,
)

# --- Decoding Output ---
output_text = tokenizer.decode(gen_tokens[0], skip_special_tokens=True)

json_match = re.search(r'({.*})', output_text, re.DOTALL)
if json_match:
    json_str = json_match.group(1)
    try:
        json_obj = json.loads(json_str)
    except json.JSONDecodeError:
        print("Invalid JSON output.")
        exit()
    print("JSON Output:", json_obj)
else:
    print("No JSON object found in the output.")

Credits

Created by: pixu.ai
GitHub Repository: https://github.com/pixuai

License

MIT

Pixuai
/

Llama-3.2-11B-Vision-Instruct-TagRater