Spaces:
Sleeping
Sleeping
Filip
commited on
Commit
·
61046e0
1
Parent(s):
96c0b50
update
Browse files
README.md
CHANGED
@@ -11,7 +11,7 @@ pinned: false
|
|
11 |
|
12 |
This is a space where you can compare two models using the technique "LLM as a Judge". LLM as a Judge uses a LLM itself for judging the response from two LLMs, and compare them based on certain evaluation metrics which are relevant for the task.
|
13 |
|
14 |
-
In this space, our default placeholder repos and models compare two LLMs finetuned on the same base model, [Llama 3.2 3B parameter model](unsloth/Llama-3.2-3B-Instruct). Both of them are finetuned on the FineTome-100k dataset, but they have been finetuned on a different amount of data.
|
15 |
|
16 |
The models were finetuned using [Unsloth](https://unsloth.ai/), a framework which allows finetuning, training and inference with LLMs 2x faster.
|
17 |
|
@@ -30,8 +30,8 @@ Quantization method: `float16`
|
|
30 |
### Hyperparameters
|
31 |
|
32 |
Both models used the same hyperparameters during training.\
|
33 |
-
`lora_alpha=16
|
34 |
-
`lora_dropout=0
|
35 |
`per_device_train_batch_size=2`\
|
36 |
`gradient_accumulation_steps=4`\
|
37 |
`learning_rate=2e-4`\
|
@@ -39,6 +39,15 @@ Both models used the same hyperparameters during training.\
|
|
39 |
`weight_decay=0.01`\
|
40 |
`lr_scheduler_type="linear"`
|
41 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
Both models have a max sequence length of 2048 tokens. This means that they only process the 2048 first tokens in the input.
|
43 |
|
44 |
We chose float16 as the quantization method as it according to [Unsloth wiki](https://github.com/unslothai/unsloth/wiki) has the fastest conversion and retains 100% accuracy. However, it is slow and memory hungry which is a disadvantage.
|
|
|
11 |
|
12 |
This is a space where you can compare two models using the technique "LLM as a Judge". LLM as a Judge uses a LLM itself for judging the response from two LLMs, and compare them based on certain evaluation metrics which are relevant for the task.
|
13 |
|
14 |
+
In this space, our default placeholder repos and models compare two LLMs finetuned on the same base model, [Llama 3.2 3B parameter model](unsloth/Llama-3.2-3B-Instruct). Both of them are finetuned on the [FineTome-100k dataset](https://huggingface.co/datasets/mlabonne/FineTome-100k), but they have been finetuned on a different amount of data.
|
15 |
|
16 |
The models were finetuned using [Unsloth](https://unsloth.ai/), a framework which allows finetuning, training and inference with LLMs 2x faster.
|
17 |
|
|
|
30 |
### Hyperparameters
|
31 |
|
32 |
Both models used the same hyperparameters during training.\
|
33 |
+
`lora_alpha=16`\
|
34 |
+
`lora_dropout=0`\
|
35 |
`per_device_train_batch_size=2`\
|
36 |
`gradient_accumulation_steps=4`\
|
37 |
`learning_rate=2e-4`\
|
|
|
39 |
`weight_decay=0.01`\
|
40 |
`lr_scheduler_type="linear"`
|
41 |
|
42 |
+
These hyperparameters are [suggested as default](https://docs.unsloth.ai/tutorials/how-to-finetune-llama-3-and-export-to-ollama) when using Unsloth. However, to experiment with them we also tried to finetune a third model by changing the hyperparameters, keeping some of of the above but changing to:
|
43 |
+
|
44 |
+
`dropout=0.3`\
|
45 |
+
`per_device_train_batch_size=20`\
|
46 |
+
`gradient_accumulation_steps=40`\
|
47 |
+
`learning_rate=2e-2`\
|
48 |
+
|
49 |
+
The effects of this were evident. One step took around 10 minutes due to the increased `gradient_accumulation_steps`, and it required significant amount of memory from the GPU due to `per_device_train_batch_size=20`. It also overfitted just in 15 steps, achieving `loss=0`, due to the high learning rate. We wanted to try if the dropout could prevent overfitting while at the same time having a high learning rate, but it could not.
|
50 |
+
|
51 |
Both models have a max sequence length of 2048 tokens. This means that they only process the 2048 first tokens in the input.
|
52 |
|
53 |
We chose float16 as the quantization method as it according to [Unsloth wiki](https://github.com/unslothai/unsloth/wiki) has the fastest conversion and retains 100% accuracy. However, it is slow and memory hungry which is a disadvantage.
|
app.py
CHANGED
@@ -15,9 +15,7 @@ def generate_response(model, prompt):
|
|
15 |
return response["choices"][0]["text"]
|
16 |
|
17 |
# Evaluate responses using the LoRA evaluation model
|
18 |
-
def evaluate_responses(prompt, repo_a, model_a, repo_b, model_b
|
19 |
-
if len(evaluation_criteria) > 3:
|
20 |
-
return "Error: Please select up to 3 evaluation criteria only."
|
21 |
|
22 |
# Load models
|
23 |
model_a_instance = load_user_model(repo_a, model_a)
|
|
|
15 |
return response["choices"][0]["text"]
|
16 |
|
17 |
# Evaluate responses using the LoRA evaluation model
|
18 |
+
def evaluate_responses(prompt, repo_a, model_a, repo_b, model_b):
|
|
|
|
|
19 |
|
20 |
# Load models
|
21 |
model_a_instance = load_user_model(repo_a, model_a)
|