Filip commited on
Commit
61046e0
·
1 Parent(s): 96c0b50
Files changed (2) hide show
  1. README.md +12 -3
  2. app.py +1 -3
README.md CHANGED
@@ -11,7 +11,7 @@ pinned: false
11
 
12
  This is a space where you can compare two models using the technique "LLM as a Judge". LLM as a Judge uses a LLM itself for judging the response from two LLMs, and compare them based on certain evaluation metrics which are relevant for the task.
13
 
14
- In this space, our default placeholder repos and models compare two LLMs finetuned on the same base model, [Llama 3.2 3B parameter model](unsloth/Llama-3.2-3B-Instruct). Both of them are finetuned on the FineTome-100k dataset, but they have been finetuned on a different amount of data.
15
 
16
  The models were finetuned using [Unsloth](https://unsloth.ai/), a framework which allows finetuning, training and inference with LLMs 2x faster.
17
 
@@ -30,8 +30,8 @@ Quantization method: `float16`
30
  ### Hyperparameters
31
 
32
  Both models used the same hyperparameters during training.\
33
- `lora_alpha=16`
34
- `lora_dropout=0`
35
  `per_device_train_batch_size=2`\
36
  `gradient_accumulation_steps=4`\
37
  `learning_rate=2e-4`\
@@ -39,6 +39,15 @@ Both models used the same hyperparameters during training.\
39
  `weight_decay=0.01`\
40
  `lr_scheduler_type="linear"`
41
 
 
 
 
 
 
 
 
 
 
42
  Both models have a max sequence length of 2048 tokens. This means that they only process the 2048 first tokens in the input.
43
 
44
  We chose float16 as the quantization method as it according to [Unsloth wiki](https://github.com/unslothai/unsloth/wiki) has the fastest conversion and retains 100% accuracy. However, it is slow and memory hungry which is a disadvantage.
 
11
 
12
  This is a space where you can compare two models using the technique "LLM as a Judge". LLM as a Judge uses a LLM itself for judging the response from two LLMs, and compare them based on certain evaluation metrics which are relevant for the task.
13
 
14
+ In this space, our default placeholder repos and models compare two LLMs finetuned on the same base model, [Llama 3.2 3B parameter model](unsloth/Llama-3.2-3B-Instruct). Both of them are finetuned on the [FineTome-100k dataset](https://huggingface.co/datasets/mlabonne/FineTome-100k), but they have been finetuned on a different amount of data.
15
 
16
  The models were finetuned using [Unsloth](https://unsloth.ai/), a framework which allows finetuning, training and inference with LLMs 2x faster.
17
 
 
30
  ### Hyperparameters
31
 
32
  Both models used the same hyperparameters during training.\
33
+ `lora_alpha=16`\
34
+ `lora_dropout=0`\
35
  `per_device_train_batch_size=2`\
36
  `gradient_accumulation_steps=4`\
37
  `learning_rate=2e-4`\
 
39
  `weight_decay=0.01`\
40
  `lr_scheduler_type="linear"`
41
 
42
+ These hyperparameters are [suggested as default](https://docs.unsloth.ai/tutorials/how-to-finetune-llama-3-and-export-to-ollama) when using Unsloth. However, to experiment with them we also tried to finetune a third model by changing the hyperparameters, keeping some of of the above but changing to:
43
+
44
+ `dropout=0.3`\
45
+ `per_device_train_batch_size=20`\
46
+ `gradient_accumulation_steps=40`\
47
+ `learning_rate=2e-2`\
48
+
49
+ The effects of this were evident. One step took around 10 minutes due to the increased `gradient_accumulation_steps`, and it required significant amount of memory from the GPU due to `per_device_train_batch_size=20`. It also overfitted just in 15 steps, achieving `loss=0`, due to the high learning rate. We wanted to try if the dropout could prevent overfitting while at the same time having a high learning rate, but it could not.
50
+
51
  Both models have a max sequence length of 2048 tokens. This means that they only process the 2048 first tokens in the input.
52
 
53
  We chose float16 as the quantization method as it according to [Unsloth wiki](https://github.com/unslothai/unsloth/wiki) has the fastest conversion and retains 100% accuracy. However, it is slow and memory hungry which is a disadvantage.
app.py CHANGED
@@ -15,9 +15,7 @@ def generate_response(model, prompt):
15
  return response["choices"][0]["text"]
16
 
17
  # Evaluate responses using the LoRA evaluation model
18
- def evaluate_responses(prompt, repo_a, model_a, repo_b, model_b, evaluation_criteria):
19
- if len(evaluation_criteria) > 3:
20
- return "Error: Please select up to 3 evaluation criteria only."
21
 
22
  # Load models
23
  model_a_instance = load_user_model(repo_a, model_a)
 
15
  return response["choices"][0]["text"]
16
 
17
  # Evaluate responses using the LoRA evaluation model
18
+ def evaluate_responses(prompt, repo_a, model_a, repo_b, model_b):
 
 
19
 
20
  # Load models
21
  model_a_instance = load_user_model(repo_a, model_a)