fine tuning your model
Hello! Great work on your fine-tuned model of idefics3. I am trying to fine tune idefics3 as well using https://github.com/merveenoyan/smol-vision/blob/main/Idefics_FT.ipynb but am running into several errors. If possible, would you be able to share your fine tuning code so I can compare what I am doing wrong?
Hi! After a brief look at your link, probably you should install dev version of transformers from official repo instead of forked one since it's already merged with several improvements. Also there are node-related cuda device mapping that should be skipped or changed according to used hardware and training of visual part is disabled in that notebook. But other than that it should work.
Code that I used is actually very simple and made out of examples, except custom collator. I've used docker package, here are some important parts:
FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 AS base
ARG DEBIAN_FRONTEND=noninteractive
RUN apt update
RUN apt install -y git git-lfs procps libsndfile1-dev tesseract-ocr python3 python3-pip python3-tk python3.10-venv ffmpeg libsm6 libxext6 libgl1 p7zip-full libaio-dev mc bash wget curl
...
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
RUN pip3 install git+https://github.com/huggingface/transformers
RUN pip3 install accelerate datasets peft bitsandbytes
RUN pip3 install flash-attn --no-build-isolation
...
Training code:
import torch
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics3ForConditionalGeneration, AutoConfig
from PIL import Image,ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True
import os
token=os.environ['HF_TOKEN']
repo=os.environ['UPLOAD_REPO']
dataset_path='...'
model_id = "..."
processor = AutoProcessor.from_pretrained(
model_id
)
with torch.autocast(device_type='cuda'):
model = Idefics3ForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2",
device_map="auto",
use_cache=False
)
image_token_id = processor.tokenizer.additional_special_tokens_ids[
processor.tokenizer.additional_special_tokens.index("<image>")]
from datasets import load_dataset,load_from_disk
import random
train_dataset = load_from_disk(dataset_path)
def collate_fn(examples):
#custom collator related to my dataset structure
return batch
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
num_train_epochs=...,
per_device_train_batch_size=3,
gradient_accumulation_steps=10,
gradient_checkpointing=True,
warmup_steps=40,
learning_rate=...,
weight_decay=0.01,
logging_steps=100,
save_strategy="steps",
lr_scheduler_type="cosine",
save_steps=...,
save_total_limit=3,
optim="adamw_8bit",
bf16=True,
push_to_hub=True,
hub_model_id=repo,
output_dir="...",
hub_strategy='all_checkpoints',
hub_token=token,
remove_unused_columns=False,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=collate_fn,
train_dataset=train_dataset,
)
trainer.train()
trainer.push_to_hub()
Hi! Thanks so much for replying!
I am running my model on Colab. Unfortunately, due to space limitations on Google Colab, I am limited to 40GB of GPU so when I tried fine tuning your proposed way, I didn't have enough space, even when modifying the parameters of the training arguments. I've tried running it on Massed Compute but for some reason, flash attention won't work on the A100 GPUs I am selecting.
Due to GPU limitations, I've been trying to fine tune using the QLORA method which works within my space constraints; however when pushing the model to Hugging Face, there's a message saying PEFT task type is invalid and I can't figure out why it says that even when I add task type to my code.
Any suggestions?
Out of curiosity, what VM are you using to run your programs?
This is my code: https://colab.research.google.com/drive/1B4CJ8o8vHSMf0qIib0vKIHFCmj5wQoSN?usp=sharing and these are the models I've made but haven't had much success: https://huggingface.co/justinkarlin
Any input or help would be so appreciated as I've tried scouring the web and haven't had much luck.
Thank you!!
Well, full weight training of 8.5B model with 40gb gpu may be kind of possible, but you have to rely on lower precision (full bf16 instead of mixed), specific optimizers with lower memory requirements and deepspeed. For last one a lot of ram is needed. Not sure I'm qualified enough to give specific advice here.
Peft (lora) should fit into 40gb and it can be your option. Qlora can be trained even on desktop gpu, but results will be worse.
Pushing model to HF isn't related to vram or training code, there is just something wrong with upload parameters and it can be fixed easily. In your repos I see successfully uploaded adapters, did you managed to solve it?
I use own hardware and rent H100s when needed. Most of providers are using docker manager, so you can build anything what you need based on linux-cuda packages from nvidia based on example above.
Here are code for PEFT adapter that worked on 48gb gpu, to make it push to HF you can add lines from code above.
import torch
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics3ForConditionalGeneration, AutoConfig
model_id = "./model"
processor = AutoProcessor.from_pretrained(
model_id
)
lora_config = LoraConfig(
r=64,
lora_alpha=16,
lora_dropout=0.1,
target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
use_dora=False if USE_QLORA else True,
init_lora_weights="gaussian"
)
lora_config.inference_mode = False
model = Idefics3ForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2",
device_map="auto",
use_cache=False
)
model.add_adapter(lora_config)
model.enable_adapters()
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
print(model.get_nb_trainable_parameters())
from datasets import load_dataset,load_from_disk
import random
train_dataset = load_from_disk('./mm_ds')
image_token_id = processor.tokenizer.additional_special_tokens_ids[
processor.tokenizer.additional_special_tokens.index("<image>")]
def collate_fn(examples):
"""
"""
return batch
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
gradient_checkpointing=True,
warmup_steps=40,
learning_rate=1e-4,
weight_decay=0.01,
logging_steps=25,
save_strategy="steps",
lr_scheduler_type="cosine",
save_steps=100,
save_total_limit=4,
optim="paged_adamw_8bit",
bf16=True,
output_dir="./idefics3_adapter_test",
remove_unused_columns=False,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=collate_fn,
train_dataset=train_dataset,
)
trainer.train()
Hmm, peft (lora) didn't work within my memory constraints so I guess I need to stick with QLora.
Thanks for taking a look at my repository. I guess I was thrown off by this message when clicking "PEFT" and use this model and assumed it was an error.
I guess my real issue is I have no idea what I'm doing.
For running my model, what did you do for the auto-processor?
Did you just create a pre-processor config file?
Could I alternatively use the below processor?
processor = AutoProcessor.from_pretrained("HuggingFaceM4/Idefics3-8B-Llama3") or was I supposed to have an autoprocessor made from my fine-tuned model? (Again I have no idea what I'm doing)
Below is what I'm doing to see if my fine-tuned model works. (is this even correct?)
Thank you for taking me on as a charity case!!
=====
import requests
import torch
from PIL import Image
from io import BytesIO
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda:0"
processor = AutoProcessor.from_pretrained("HuggingFaceM4/Idefics3-8B-Llama3") #or do I need to do something else?
model = AutoModelForVision2Seq.from_pretrained("justinkarlin/idefics3-qlora-faces4",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2",
).to(DEVICE)
image = load_image('https://eyewiki.org/w/images/5/5b/Marcus_Marcet_Eyewiki_involutional_blepharoptosis_greater_in_LUL.jpg') #path to your picture
###Trained options
user_prompt="Describe the facial features in this photo."
messages = [
{
#Important!
"role": "system",
"content": [
{"type": "text", "text": "Describe the facial features in this photo"}
]
},
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": user_prompt}
]
}
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
caption=generated_texts[0].split('Assistant: ')[1]
print(caption)
You should use original idefics3 preprocessor config, just name of original repo or path to your local model copy.
Since you are using lora-adapter it is loaded on top of original model. First you need to load main model:
model_path="HuggingFaceM4/Idefics3-8B-Llama3" #or change to local path
processor = AutoProcessor.from_pretrained(model_path)
model = Idefics3ForConditionalGeneration.from_pretrained(
model_path, torch_dtype=torch.bfloat16
).to(DEVICE)
Then apply lora adapter on top of it
peft_model_id="repo/or/local/path"
model.load_adapter(peft_model_id)
After you can inference it as usual model. However, there were some issues with very slow inference with loaded lora due to wrong work of context cache discussion.
By the way, have you checked huggingface nlp course? It's quite interesting and will help to get more familiar with ML in general and llm/vlm in particular.
Hmm.. is this right? For some reason when I run this, the generated output is worse than when I feed the image through idefics3 playground. Thanks for sharing the hugging face NLP course, I definitely need this!
model_path="HuggingFaceM4/Idefics3-8B-Llama3"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForVision2Seq.from_pretrained(model_path,
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2",
).to(DEVICE)
peft_model_id="justinkarlin/idefics3-qlora-faces4"
model.load_adapter(peft_model_id)
image = load_image('https://eyewiki.org/w/images/5/5b/Marcus_Marcet_Eyewiki_involutional_blepharoptosis_greater_in_LUL.jpg') #path to your picture
###Trained options
user_prompt="Answer briefly."
messages = [
{
#Important!
"role": "system",
"content": [
{"type": "text", "text": "You are an image expert. A patient sends a photo of themselves, seeking oculoplastics aesthetic consultation. What are the facial features in this photo?"}
]
},
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": user_prompt}
]
}
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
caption=generated_texts[0].split('Assistant: ')[1]
print(caption)
===
These are my metrics from the fine tuning training. When I compare the metrics to other examples online, it seems OK but it also doesn't seem to output the individual step training loss which makes me worry that my poor generated output is from an error in fine tuning.