metadata

base_model: mistralai/Mistral-7B-Instruct-v0.3
datasets:
  - generator
library_name: peft
license: apache-2.0
tags:
  - trl
  - sft
  - generated_from_trainer
model-index:
  - name: Mistral-7B-text-to-sql-flash-attention-2-dataeval
    results: []

Mistral-7B-text-to-sql-flash-attention-2-dataeval

This model is a fine-tuned version of mistralai/Mistral-7B-Instruct-v0.3 on the generator dataset. It achieves the following results on the evaluation set:

Loss: 0.4605

Model description

Article: https://medium.com/@frankmorales_91352/fine-tuning-the-llm-mistral-7b-instruct-v0-3-249c1814ceaf

Training and evaluation data

Fine Tuning and Evaluation: https://github.com/frank-morales2020/MLxDL/blob/main/FineTuning_LLM_Mistral_7B_Instruct_v0_1_for_text_to_SQL_EVALDATA.ipynb

Evaluation: https://github.com/frank-morales2020/MLxDL/blob/main/Evaluator_Mistral_7B_text_to_sql.ipynb

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0002
train_batch_size: 3
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 24
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
lr_scheduler_warmup_ratio: 0.03
lr_scheduler_warmup_steps: 15
num_epochs: 3

from transformers import TrainingArguments args = TrainingArguments( output_dir="Mistral-7B-text-to-sql-flash-attention-2-dataeval",

num_train_epochs=3,                     # number of training epochs
per_device_train_batch_size=3,          # batch size per device during training
gradient_accumulation_steps=8,      #2  # number of steps before performing a backward/update pass
gradient_checkpointing=True,            # use gradient checkpointing to save memory
optim="adamw_torch_fused",              # use fused adamw optimizer
logging_steps=10,                       # log every 10 steps
#save_strategy="epoch",                  # save checkpoint every epoch
learning_rate=2e-4,                     # learning rate, based on QLoRA paper
bf16=True,                              # use bfloat16 precision
tf32=True,                              # use tf32 precision
max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper
weight_decay=0.01,
lr_scheduler_type="constant",           # use constant learning rate scheduler
push_to_hub=True,                       # push model to hub
report_to="tensorboard",                # report metrics to tensorboard
hub_token=access_token_write,           # Add this line
load_best_model_at_end=True,
logging_dir="/content/gdrive/MyDrive/model/Mistral-7B-text-to-sql-flash-attention-2-dataeval/logs",
evaluation_strategy="steps",
eval_steps=10,
save_strategy="steps",
save_steps=10,
metric_for_best_model = "loss",
warmup_steps=15,

)

Training results

Training Loss	Epoch	Step	Validation Loss
1.8612	0.4020	10	0.6092
0.5849	0.8040	20	0.5307
0.4937	1.2060	30	0.4887
0.4454	1.6080	40	0.4670
0.425	2.0101	50	0.4544
0.3498	2.4121	60	0.4717
0.3439	2.8141	70	0.4605

Framework versions

PEFT 0.11.1
Transformers 4.41.2
Pytorch 2.3.0+cu121
Datasets 2.20.0
Tokenizers 0.19.1