How to fine-tune this? + Training code

#19
by cekal - opened

I have tried fine-tuning the model with LoRA (peft) using the following target modules: 'lm_head.linear', 'transformer.embd.wte' - which resulted in better responses, but I feel like something is wrong in my training setup, as the model often behaves weirdly, and its responses are significantly worse than the ones from Mistral 7B. Considering Microsoft called this the state-of-art model below 13b parameters, mentioning it beats Mistral, it should outperform it, not underperform. I use a high-quality proprietary Q&A dataset, so the dataset quality cannot be the issue.

Just to confirm, am I using the right 'target_modules', or I should use different ones? Here is my training code:

import os
from dataclasses import dataclass, field
from typing import Optional

import torch
from datasets import load_dataset
from datasets import load_from_disk
from peft import LoraConfig
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    AutoTokenizer,
    TrainingArguments,
)
from tqdm.notebook import tqdm

from trl import SFTTrainer
from huggingface_hub import interpreter_login

interpreter_login()

compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype='float16',
        bnb_4bit_use_double_quant=False,
    )
device_map = {"": 0}

#Download model
model = AutoModelForCausalLM.from_pretrained(
        "microsoft/phi-2", 
        quantization_config=bnb_config, 
        device_map=device_map,
        trust_remote_code=True,
        use_auth_token=True
    )

model.config.pretraining_tp = 1 
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=32,
    target_modules=['lm_head.linear', 'transformer.embd.wte'], # is this correct?
    bias="none",
    task_type="CAUSAL_LM", 
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

training_arguments = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    save_steps=500, #CHANGE THIS IF YOU WANT IT TO SAVE LESS OFTEN. I WOULDN'T SAVE MORE OFTEN BECAUSE OF SPACE
    logging_steps=10,
    learning_rate=2e-4,
    fp16=False,
    bf16=True,
    max_grad_norm=.3,
    max_steps=10000,
    warmup_ratio=.03,
    group_by_length=True,
    lr_scheduler_type="constant",
)

model.config.use_cache = False

dataset = load_dataset("json", data_files="your_dataset.json", split="train")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=2048,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)

trainer.train()

This might be naive, but if loading fp16, why do you train with bf16 true?

I'm guessing we need additional target modules + higher rank given the model is smaller? If you're only using one gpu the effective batch size is still really small - they trained over a ton of tokens, I'm wondering if the lr might need to be lower as well.

That being said you made it further than I did, I was running into the gradient checkpointing error (there's already a pull request, so I was hoping that would be merged in). So I haven't experimented nearly enough. Thanks for providing your code since at least it runs and you have me beat there...

Regarding your question about bf16 & fp16:

When you load a model in fp16 (float16), it uses less memory, which is great for handling large models. But, training a model can be more complex and requires better precision. That's where bf16 (bfloat16) comes in during training – it still saves memory like fp16, but it's better for the calculations needed in training, giving you a good balance between saving memory and having accurate training.

“I'm guessing we need additional target modules + higher rank given the model is smaller?” - Maybe. What I did was executing print(model), copying all the info about it, pasting it into GPT-4 and it selected the 2 modules specified in my previous message as the ones I should target.

Anyways I have no idea but my only hope is that I’ve missed some modules or messed something up otherwise the training results are disappointing. If you figure it out please let me know, will do the same if I come to some new info.

So in LoraConfig, I have read the paper and got to know that we have to use the Self attention layer.
Below is my loraconfig
LoraConfig(
r=32,
lora_alpha=16,
target_modules=[
'Wqkv',
'out_proj'
],
bias="none",
lora_dropout=0.05, # Conventional
task_type="CAUSAL_LM",
)

@navanit thanks for sharing! I will begin another training with these target modules and see if the performance improves or not. Will keep you all updated.

Excellent results! @navanit thank you for confirming the correct target_modules, the model now responds as expected.

Here is an example prompt I gave it: How can advances in artificial intelligence and machine learning contribute to more accurate and timely weather forecasting, and what are the limitations of relying on these technologies for weather predictions?

Screenshot 2023-12-15 at 14.10.45.png

Screenshot 2023-12-15 at 14.21.36.png

Great reasoning capability as well, GPT-3.5-Turbo wasn't able to answer this one correctly:

Screenshot 2023-12-15 at 14.22.59.png

@cekal I am facing the error by using your code.
ValueError: PhiForCausalLM does not support gradient checkpointing.
any walkthrough?

@Navanit-shorthills which GPU are you using? I'm on 1x A100 runpod.io (Jupyter notebook). The error you're encountering is due to the incompatibility of the PhiForCausalLM model with gradient checkpointing. To resolve this, you need to disable gradient checkpointing. This might increase memory usage, but it's necessary for this specific model architecture. You may try passing

model.config.gradient_checkpointing = False

right after loading the model. Replace the following section of the previous script with this one and try running it:

# Configure model and training
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype='float16',
    bnb_4bit_use_double_quant=False,
)

device_map = {"": 0}
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2", 
    quantization_config=bnb_config, 
    device_map=device_map,
    trust_remote_code=True,
    use_auth_token=True
)

# Disable gradient checkpointing
model.config.gradient_checkpointing = False

Let me know if that solves the issue or not.

@cekal thanks for the answer, currently I am using NVIDIA GeForce RTX 3090 of 24.5 GB GPU. will see if I can train on it.

@cekal you were right, I tried working around. After disabling the gradient_checkpointing, started facing Cuda_out_of memory error. Is there any turn around since with the same GPU i trained llama 2 7b , mistral 7b but unable to fine tune the 2b parameter model.

@Navanit-shorthills it seems like more people are running into this problem. Instead of trying to turn off gradient checkpointing which is probably not the most effective approach, try adding checkpointing=true to model=AutoModelForCasualLM.from_pretrained

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2",  checkpointing=True)

as mentioned here:
Screenshot 2023-12-15 at 20.17.38.png

But again, I cannot verify if this works, as @rungao2001 got "TypeError: PhiForCausalLM.init() got an unexpected keyword argument 'checkpointing'" error when applying this. But try it, might work.

If that doesn't work, try doing the model.config.gradient_checkpointing = False approach as before but reduce the batch size and try training on a lower max_seq_length (e.g. max_seq_length=2048 ----> max_seq_length=1096). But this can produce a less capable model.

Last suggestion if everything fails is to either wait, as it seems like more people are encountering this issue, or using cloud computing like runpod.io (cost me $15-$20 to fully fine-tune it).

@cekal thanks I was able to fine tune by decreasing the max_seq_length = 720.

Also, I had used the below config.

image.png

But still the same, I was able to train mistal or llama 2 7b parameters with 2048 max_seq_length on my 24GB gpu

Screenshot 2023-12-15 at 14.21.36.png

Great reasoning capability as well, GPT-3.5-Turbo wasn't able to answer this one correctly:

Screenshot 2023-12-15 at 14.22.59.png

image.png

image.png

I tried this prompt with different number . both chatgpt and phi-2 gave the wrong answer ??

@Deepakvictor might be because you used a different version of the model. The results I displayed were from my custom fine-tuned version of phi-2, which is currently private.

https://github.com/hiyouga/LLaMA-Factory this repo seems supporting Phi-2, here is my toy working script

#!/bin/bash

eval "$(conda shell.bash hook)"
conda activate llama_factory

MODEL_NAME=phi-2
STAGE=sft
EPOCH=.01 #3.0
DATA=alpaca_gpt4_zh
SAVE_PATH=./models/$STAGE/$MODEL_NAME-$STAGE-$DATA-$EPOCH
SAVE_PATH_PREDICT=$SAVE_PATH/Predict
MODEL_PATH=./models/$MODEL_NAME
LoRA_TARGET=Wqkv #q_proj,v_proj
TEMPLATE=default
PREDICTION_SAMPLES=20

if [ ! -d $MODEL_PATH ]; then
    echo "Model not found: $MODEL_PATH"
    return 1
fi

if [ ! -d $SAVE_PATH ]; then
    mkdir -p $SAVE_PATH
fi

if [ ! -d $SAVE_PATH_PREDICT ]; then
    mkdir -p $SAVE_PATH_PREDICT
fi

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --seed 42 \
    --stage $STAGE \
    --model_name_or_path $MODEL_PATH \
    --dataset $DATA \
    --val_size .1 \
    --val_max_sample 20 \
    --finetuning_type lora \
    --do_train \
    --lora_target $LoRA_TARGET \
    --output_dir $SAVE_PATH \
    --overwrite_output_dir \
    --overwrite_cache \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs $EPOCH \
    --do_eval \
    --evaluation_strategy steps \
    --per_device_eval_batch_size 1 \
    --prediction_loss_only \
    --plot_loss \
    --quantization_bit 4 \
    |& tee $SAVE_PATH/train_eval_log.txt

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage $STAGE \
    --model_name_or_path $MODEL_PATH \
    --do_predict \
    --max_samples $PREDICTION_SAMPLES \
    --predict_with_generate \
    --dataset $DATA \
    --template $TEMPLATE \
    --finetuning_type lora \
    --adapter_name_or_path $SAVE_PATH \
    --output_dir $SAVE_PATH_PREDICT \
    --per_device_eval_batch_size 1 \
    |& tee $SAVE_PATH_PREDICT/predict_log.txt

@cekal I am facing the error by using your code.
ValueError: PhiForCausalLM does not support gradient checkpointing.
any walkthrough?

Me too

This comment has been hidden
base_model = "microsoft/phi-2"
new_model = "phi-2-pa"
dataset = datasets.load_from_disk('wiki_pa_train_dataset')

tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)
tokenizer.pad_token=tokenizer.eos_token
tokenizer.padding_side="right"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    # use_flash_attention_2=True, # Phi does not support yet.
    trust_remote_code=True,
    flash_attn=True,
    flash_rotary=True, 
    fused_dense=True,
    low_cpu_mem_usage=True,
    device_map={"": 0},
    revision="refs/pr/23",
)

model.config.use_cache = False
model.config.pretraining_tp = 1

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=2, 
    gradient_accumulation_steps=32, 
    evaluation_strategy="steps",
    eval_steps=2000,
    logging_steps=15,
    optim="paged_adamw_8bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    save_steps=2000,
    warmup_ratio=0.05,
    weight_decay=0.01,
    report_to="tensorboard",
    max_steps=-1, # if maximum steps=2, it will stop after two steps
)

peft_config = LoraConfig(
    r=32,
    lora_alpha=64,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules= ["Wqkv", "fc1", "fc2" ] # ["Wqkv", "out_proj", "fc1", "fc2" ], - 41M params
    # modules_to_save=["embed_tokens","lm_head"] 
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'],
    eval_dataset=dataset['train'], #No separate evaluation dataset, i am using the same dataset
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=690,
    tokenizer=tokenizer,
    args=training_arguments,
)

Hi folks, here is my ft result done by llama_factory

https://huggingface.co/microsoft/phi-2/discussions/35#65819d07ca21d74c214cb3f6

@cekal thanks I was able to fine tune by decreasing the max_seq_length = 720.

Also, I had used the below config.

But still the same, I was able to train mistal or llama 2 7b parameters with 2048 max_seq_length on my 24GB gpu

@Navanit-shorthills true, I'm also having the same issue

@pbatra if you find any answer kindly reply in this thread.

Excellent results! @navanit thank you for confirming the correct target_modules, the model now responds as expected.

Here is an example prompt I gave it: How can advances in artificial intelligence and machine learning contribute to more accurate and timely weather forecasting, and what are the limitations of relying on these technologies for weather predictions?

Screenshot 2023-12-15 at 14.10.45.png

Can you share the final working code that worked for you?

你好 这个开源模型是已经训练好的吗 它可以转换成中文的吗 感谢 本人萌新一枚

phi-2 has bug in speaking Chinese, it spits out gerberish

@Yhyu13 because the base model was trained on English dataset, as seen on the picture below:

Screenshot 2024-01-05 at 21.58.42.png

Sometimes, all you need is to read the documentation.

I have tried fine-tuning the model with LoRA (peft) using the following target modules: 'lm_head.linear', 'transformer.embd.wte' - which resulted in better responses, but I feel like something is wrong in my training setup, as the model often behaves weirdly, and its responses are significantly worse than the ones from Mistral 7B. Considering Microsoft called this the state-of-art model below 13b parameters, mentioning it beats Mistral, it should outperform it, not underperform. I use a high-quality proprietary Q&A dataset, so the dataset quality cannot be the issue.

Just to confirm, am I using the right 'target_modules', or I should use different ones? Here is my training code:

import os
from dataclasses import dataclass, field
from typing import Optional

import torch
from datasets import load_dataset
from datasets import load_from_disk
from peft import LoraConfig
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    AutoTokenizer,
    TrainingArguments,
)
from tqdm.notebook import tqdm

from trl import SFTTrainer
from huggingface_hub import interpreter_login

interpreter_login()

compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype='float16',
        bnb_4bit_use_double_quant=False,
    )
device_map = {"": 0}

#Download model
model = AutoModelForCausalLM.from_pretrained(
        "microsoft/phi-2", 
        quantization_config=bnb_config, 
        device_map=device_map,
        trust_remote_code=True,
        use_auth_token=True
    )

model.config.pretraining_tp = 1 
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=32,
    target_modules=['lm_head.linear', 'transformer.embd.wte'], # is this correct?
    bias="none",
    task_type="CAUSAL_LM", 
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

training_arguments = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    save_steps=500, #CHANGE THIS IF YOU WANT IT TO SAVE LESS OFTEN. I WOULDN'T SAVE MORE OFTEN BECAUSE OF SPACE
    logging_steps=10,
    learning_rate=2e-4,
    fp16=False,
    bf16=True,
    max_grad_norm=.3,
    max_steps=10000,
    warmup_ratio=.03,
    group_by_length=True,
    lr_scheduler_type="constant",
)

model.config.use_cache = False

dataset = load_dataset("json", data_files="your_dataset.json", split="train")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=2048,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)

trainer.train()

@cekal ,I have a question ,in your fine-tune,trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=2048,
tokenizer=tokenizer,
args=training_arguments,
packing=False,
),in this code, dataset_text_field='text',What is the corresponding content? it's prompt ?

@zoujiulong In the fine-tuning script, the dataset_text_field parameter in the SFTTrainer object specifies the field name from your dataset that contains the text data used for training. This is not necessarily a prompt, but rather the actual textual content that you want the model to learn from.

Your dataset, which the script loads with load_dataset("json", data_files="your_dataset.json", split="train"), is expected to be a collection of records, where each record is a JSON object. The dataset_text_field='text' means that the trainer will look for a field named "text" in each JSON object of your dataset. This "text" field should contain the actual textual data.

For example, if you are training a language model and your dataset consists of sentences or paragraphs, each JSON object in your dataset file might look like this:

{ "text": "Here is a sample sentence for the language model to learn." }

In this case, "text" is the key in each JSON object that points to the actual textual data you want the model to train on. If your dataset uses a different field name to store this textual data, you should change the dataset_text_field parameter accordingly to match that field name.

@cekal thank you,I see,I’m a green hand.I have one more question,your purpose is Q&A,I remember that Should not you enter both the question and text such as BertForQuestionAnswering,why only use a field at here,Is phi-2 able to learn just by typing in text and then just asking?

very bad model. fine tune not working properly. :(
[ 12/100 00:08 < 01:17, 1.13 it/s, Epoch 0.00/1]
Step Training Loss
1 0.000000
2 0.000000
3 0.000000
4 0.000000
5 0.000000
6 0.000000
7 0.000000
8 0.000000
9 0.000000
10 0.000000

@Imran1 model isn't bad, perhaps your code is. 0 loss is obviously wrong. Mind sharing your fine-tuning script?

You can also try this: https://github.com/brevdev/notebooks/blob/e815947d907460c3ed123d49ac6aeab67a9adf22/phi2-finetune-own-data.ipynb

@cekal why the lose are showing zero?

Could you please re-run with the latest update (FP16)? We updated the modeling_phi.py file and disabled the auto-casting on the Attention layer. This is the same fix as the previous code had.

@gugarosa I have performed full finetune with phi-2 on a single RTX A6000, but the loss is very quickly going to zero for just 10 steps. I have tried with the latest tranformers==4.37.0. Can you help me this? Thanks.

My implementation is followed: https://github.com/brevdev/notebooks/blob/e815947d907460c3ed123d49ac6aeab67a9adf22/phi2-finetune-own-data.ipynb, but I commented out the quantization and lora parts for full finetuning.

Hi @cekal , I am trying to fine-tune and I am using target = ["Wqkv", "out_proj"] after exploring a few notebooks, but it is throwing error that the target modules are not present, I checked the model architecture too and I could see this :
PhiForCausalLM(
(model): PhiModel(
(embed_tokens): Embedding(51200, 2560)
(embed_dropout): Dropout(p=0.0, inplace=False)
(layers): ModuleList(
(0-31): 32 x PhiDecoderLayer(
(self_attn): PhiAttention(
(q_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
(k_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
(v_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
(dense): Linear4bit(in_features=2560, out_features=2560, bias=True)
(rotary_emb): PhiRotaryEmbedding()
)
(mlp): PhiMLP(
(activation_fn): NewGELUActivation()
(fc1): Linear4bit(in_features=2560, out_features=10240, bias=True)
(fc2): Linear4bit(in_features=10240, out_features=2560, bias=True)
)
(input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
(resid_dropout): Dropout(p=0.1, inplace=False)
)
)
(final_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=2560, out_features=51200, bias=True)
)

Can you please suggest, what is that I am missing? I downloaded the model manually due to some network restrictions in my org.

@sbakhtyar Hi, based on the output you provided, try these:

target_modules = [
    "q_proj",  # Targeting query projection in PhiAttention
    "k_proj",  # Targeting key projection in PhiAttention
    "v_proj",  # Targeting value projection in PhiAttention
    "dense",   # Targeting the dense layer in PhiAttention for output transformation, not sure if appropriate, comment out if not necessary
    "fc1",     # Targeting the first fully connected layer in PhiMLP
    "fc2",     # Targeting the second fully connected layer in PhiMLP
]

Let me know how it goes!

Hey,
how do i know which attentional layer to choose as target_modules ? Right now i am using target_modules= ["Wqkv", "fc1", "fc2" ] for fine tuning phi-2. And in the LoRA paper the authors stated that they only tried their approach for the attention module and that there is more research needed for the the MLP module. Which target_modules should i choose and why ?
I appreciate all answers :)

I have tried fine-tuning the model with LoRA (peft) using the following target modules: 'lm_head.linear', 'transformer.embd.wte' - which resulted in better responses, but I feel like something is wrong in my training setup, as the model often behaves weirdly, and its responses are significantly worse than the ones from Mistral 7B. Considering Microsoft called this the state-of-art model below 13b parameters, mentioning it beats Mistral, it should outperform it, not underperform. I use a high-quality proprietary Q&A dataset, so the dataset quality cannot be the issue.

Just to confirm, am I using the right 'target_modules', or I should use different ones? Here is my training code:

import os
from dataclasses import dataclass, field
from typing import Optional

import torch
from datasets import load_dataset
from datasets import load_from_disk
from peft import LoraConfig
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    AutoTokenizer,
    TrainingArguments,
)
from tqdm.notebook import tqdm

from trl import SFTTrainer
from huggingface_hub import interpreter_login

interpreter_login()

compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype='float16',
        bnb_4bit_use_double_quant=False,
    )
device_map = {"": 0}

#Download model
model = AutoModelForCausalLM.from_pretrained(
        "microsoft/phi-2", 
        quantization_config=bnb_config, 
        device_map=device_map,
        trust_remote_code=True,
        use_auth_token=True
    )

model.config.pretraining_tp = 1 
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=32,
    target_modules=['lm_head.linear', 'transformer.embd.wte'], # is this correct?
    bias="none",
    task_type="CAUSAL_LM", 
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

training_arguments = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    save_steps=500, #CHANGE THIS IF YOU WANT IT TO SAVE LESS OFTEN. I WOULDN'T SAVE MORE OFTEN BECAUSE OF SPACE
    logging_steps=10,
    learning_rate=2e-4,
    fp16=False,
    bf16=True,
    max_grad_norm=.3,
    max_steps=10000,
    warmup_ratio=.03,
    group_by_length=True,
    lr_scheduler_type="constant",
)

model.config.use_cache = False

dataset = load_dataset("json", data_files="your_dataset.json", split="train")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=2048,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)

trainer.train()

Hi, @cekal ,

Can you please share your requirements.txt file? I am trying to finetune this model but I am getting an error from the bitsandbytes package:

Failed to import transformers.integrations.bitsandbytes because of the following error (look up to see its traceback):

    CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

Thanks,

Sorry, I'm new to finetuning LLMs and my question might be too basic:
I have a DataFrame with two columns. "prompt" and "completion". The prompt is a statement and the completion is an argument in favor of that statement. I want to fine-tune Phi-2 for it.
I don't know if I should keep the two columns and give them separately to the model as input and label (if so, how should I give the label text to SFTTrainer?) or should I merge the two columns as one complete text column and feed that to the model? If so, how should I exactly combine the texts? I mean what special tokens should I put in the middle?

      from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType # peft-0.7.1
      import torch
      from transformers import (
          AutoModelForCausalLM,
          AutoTokenizer,
          BitsAndBytesConfig,
          HfArgumentParser,
          AutoTokenizer,
          TrainingArguments,
      )

      bnb_config = BitsAndBytesConfig(
              load_in_4bit=True,
              bnb_4bit_quant_type='nf4',
              bnb_4bit_compute_dtype='float16',
              bnb_4bit_use_double_quant=False
              )

      model_path = "/.../phi-2/"

load model

      model = AutoModelForCausalLM.from_pretrained(
              model_path, 
              quantization_config=bnb_config, 
      #         device_map=device_map,
              trust_remote_code=True,
      #         use_auth_token=True
              )
      tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

      peft_config = LoraConfig(
              r=8,
              lora_alpha=8,
              target_modules=['q_proj',
                              'k_proj',
                              'v_proj',
                              'dense',
                              'fc1',
                              'fc2',
                              ],
              bias="none",
              lora_dropout=0.05, # Conventional
              task_type="CAUSAL_LM",
              modules_to_save = ["lm_head", "embed_tokens"]   # because we added new tokens
              )

      # add LoRA adaptor
      model = get_peft_model(model, peft_config)


      from transformers import DataCollatorForSeq2Seq

      # we want to ignore tokenizer pad token in the loss
      label_pad_token_id = -100
      # Data collator
      data_collator = DataCollatorForSeq2Seq(
          tokenizer,
          model=model,
          label_pad_token_id=label_pad_token_id,
          pad_to_multiple_of=8
      )



      from datasets import Dataset, concatenate_datasets

      training_arguments = TrainingArguments(
          output_dir="./results",
          per_device_train_batch_size=1,
          gradient_accumulation_steps=4,
          optim="paged_adamw_32bit",
          save_steps=500, #CHANGE THIS IF YOU WANT IT TO SAVE LESS OFTEN. I WOULDN'T SAVE MORE OFTEN BECAUSE OF SPACE
          logging_steps=10,
          learning_rate=2e-4,
          fp16=False,
          bf16=True,
          max_grad_norm=.3,
          max_steps=10000,
          warmup_ratio=.03,
          group_by_length=True,
          lr_scheduler_type="constant"
          )

      model.config.use_cache = False

      train_dataset_object = Dataset.from_pandas(data_df[['sentence_j',
                                                                'sentence_i']].rename({'sentence_j':'prompt',
                                                                                       'sentence_i':'completion'},axis=1))  # here is where I'm unsure what to do


      # Create a Data Collator for Seq2Seq LM
      data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, pad_token_id=tokenizer.pad_token_id)

      # Prepare the dataset for SFTTrainer
      train_dataset_generator = torch.utils.data.DataLoader(train_dataset_object, batch_size=32, collate_fn=data_collator)

      trainer = SFTTrainer(
          model=model,
          train_dataset=dataset,
          peft_config=peft_config,
          dataset_text_field="text",
          max_seq_length=2048,
          tokenizer=tokenizer,
          args=training_arguments,
          packing=False,
      )

      trainer.train()

I want to train the phi-2 model using CPU. The configs are the same?

how can we do instruct fine tune phi 2 ? so it will follow instruction

@gugarosa I have performed full finetune with phi-2 on a single RTX A6000, but the loss is very quickly going to zero for just 10 steps. I have tried with the latest tranformers==4.37.0. Can you help me this? Thanks.

My implementation is followed: https://github.com/brevdev/notebooks/blob/e815947d907460c3ed123d49ac6aeab67a9adf22/phi2-finetune-own-data.ipynb, but I commented out the quantization and lora parts for full finetuning.

Does generation stops when it should for you?
For me, phi-2 and phi-1.5 always generate until max-length is reached, if defined.

Sign up or log in to comment