File size: 4,869 Bytes
8ab2d00
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c125a1e
8ab2d00
 
c125a1e
 
783d47e
 
 
c125a1e
 
 
 
 
 
 
 
 
 
f957795
 
8ab2d00
 
81230bf
8ab2d00
 
 
81230bf
8ab2d00
0f028c0
8ab2d00
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81230bf
3b828b3
81230bf
053207b
 
81230bf
 
 
 
 
c125a1e
81230bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
813eddf
81230bf
 
8ab2d00
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
base_model: mistralai/Mistral-7B-Instruct-v0.3
datasets:
- generator
library_name: peft
license: apache-2.0
tags:
- trl
- sft
- generated_from_trainer
model-index:
- name: Mistral-7B-text-to-sql-flash-attention-2-dataeval
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Mistral-7B-text-to-sql-flash-attention-2-dataeval

This model is a fine-tuned version of [mistralai/Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) on the generator dataset.
It achieves the following results on the evaluation set:

- Loss: 0.4605

Perplexity of 10.40

Perplexity: Perplexity is a measure of how uncertain or surprised the model is about its predictions. 
It's derived from the probabilities the model assigns to different words or tokens.

Perplexity Article: https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf 
https://medium.com/@AyushmanPranav/perplexity-calculation-in-nlp-0699fbda4594

 The perplexity of 10.40 achieved on the dataset indicates that the fine-tuned Mistral-7B model reasonably understands natural language and SQL syntax. 
 However, further evaluation using task-specific metrics is necessary to assess the model's effectiveness in real-world scenarios. 
 By combining quantitative metrics like perplexity with qualitative analysis of generated queries, 
 we can comprehensively understand the model's strengths and weaknesses, ultimately 
 leading to improved performance and more reliable text-to-SQL translation capabilities. 
 

Dataset : [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)

## Model description

Article: https://medium.com/@frankmorales_91352/fine-tuning-the-llm-mistral-7b-instruct-v0-3-249c1814ceaf

## Training and evaluation data

Fine Tuning and Evaluation: https://github.com/frank-morales2020/MLxDL/blob/main/FineTuning_LLM_Mistral_7B_Instruct_v0_1_for_text_to_SQL_EVALDATA.ipynb

Evaluation: https://github.com/frank-morales2020/MLxDL/blob/main/Evaluator_Mistral_7B_text_to_sql.ipynb

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 3
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 24
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: constant
- lr_scheduler_warmup_ratio: 0.03
- lr_scheduler_warmup_steps: 15
- num_epochs: 3

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="Mistral-7B-text-to-sql-flash-attention-2-dataeval", 
    
    num_train_epochs=3,                     # number of training epochs
    per_device_train_batch_size=3,          # batch size per device during training
    gradient_accumulation_steps=8,      #2  # number of steps before performing a backward/update pass
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=10,                       # log every ten steps
    #save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    bf16=True,                              # use bfloat16 precision
    tf32=True,                              # use tf32 precision
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper
    weight_decay=0.01,
    lr_scheduler_type="constant",           # use constant learning rate scheduler
    push_to_hub=True,                       # push model to hub
    report_to="tensorboard",                # report metrics to tensorboard
    hub_token=access_token_write,           # Add this line
    load_best_model_at_end=True,
    logging_dir="/content/gdrive/MyDrive/model/Mistral-7B-text-to-sql-flash-attention-2-dataeval/logs",
    evaluation_strategy="steps",
    eval_steps=10,
    save_strategy="steps",
    save_steps=10,
    metric_for_best_model = "loss",
    warmup_steps=15,
    
)

### Training results

| Training Loss | Epoch  | Step | Validation Loss |
|:-------------:|:------:|:----:|:---------------:|
| 1.8612        | 0.4020 | 10   | 0.6092          |
| 0.5849        | 0.8040 | 20   | 0.5307          |
| 0.4937        | 1.2060 | 30   | 0.4887          |
| 0.4454        | 1.6080 | 40   | 0.4670          |
| 0.425         | 2.0101 | 50   | 0.4544          |
| 0.3498        | 2.4121 | 60   | 0.4717          |
| 0.3439        | 2.8141 | 70   | 0.4605          |


### Framework versions

- PEFT 0.11.1
- Transformers 4.41.2
- Pytorch 2.3.0+cu121
- Datasets 2.20.0
- Tokenizers 0.19.1