Spaces:

forestav
/

llm-as-a-judge

Sleeping

App Files Files Community

Filip commited on Dec 10, 2024

Commit

61046e0

1 Parent(s): 96c0b50

update

Browse files

Files changed (2) hide show

README.md +12 -3
app.py +1 -3

README.md CHANGED Viewed

@@ -11,7 +11,7 @@ pinned: false
 This is a space where you can compare two models using the technique "LLM as a Judge". LLM as a Judge uses a LLM itself for judging the response from two LLMs, and compare them based on certain evaluation metrics which are relevant for the task.
-In this space, our default placeholder repos and models compare two LLMs finetuned on the same base model, [Llama 3.2 3B parameter model](unsloth/Llama-3.2-3B-Instruct). Both of them are finetuned on the FineTome-100k dataset, but they have been finetuned on a different amount of data.
 The models were finetuned using [Unsloth](https://unsloth.ai/), a framework which allows finetuning, training and inference with LLMs 2x faster.
@@ -30,8 +30,8 @@ Quantization method: `float16`
 ### Hyperparameters
 Both models used the same hyperparameters during training.\
-`lora_alpha=16`
-`lora_dropout=0`
 `per_device_train_batch_size=2`\
 `gradient_accumulation_steps=4`\
 `learning_rate=2e-4`\
@@ -39,6 +39,15 @@ Both models used the same hyperparameters during training.\
 `weight_decay=0.01`\
 `lr_scheduler_type="linear"`
 Both models have a max sequence length of 2048 tokens. This means that they only process the 2048 first tokens in the input.
 We chose float16 as the quantization method as it according to [Unsloth wiki](https://github.com/unslothai/unsloth/wiki) has the fastest conversion and retains 100% accuracy. However, it is slow and memory hungry which is a disadvantage.

 This is a space where you can compare two models using the technique "LLM as a Judge". LLM as a Judge uses a LLM itself for judging the response from two LLMs, and compare them based on certain evaluation metrics which are relevant for the task.
+In this space, our default placeholder repos and models compare two LLMs finetuned on the same base model, [Llama 3.2 3B parameter model](unsloth/Llama-3.2-3B-Instruct). Both of them are finetuned on the [FineTome-100k dataset](https://huggingface.co/datasets/mlabonne/FineTome-100k), but they have been finetuned on a different amount of data.
 The models were finetuned using [Unsloth](https://unsloth.ai/), a framework which allows finetuning, training and inference with LLMs 2x faster.
 ### Hyperparameters
 Both models used the same hyperparameters during training.\
+`lora_alpha=16`\
+`lora_dropout=0`\
 `per_device_train_batch_size=2`\
 `gradient_accumulation_steps=4`\
 `learning_rate=2e-4`\
 `weight_decay=0.01`\
 `lr_scheduler_type="linear"`
+These hyperparameters are [suggested as default](https://docs.unsloth.ai/tutorials/how-to-finetune-llama-3-and-export-to-ollama) when using Unsloth. However, to experiment with them we also tried to finetune a third model by changing the hyperparameters, keeping some of of the above but changing to:
+`dropout=0.3`\
+`per_device_train_batch_size=20`\
+`gradient_accumulation_steps=40`\
+`learning_rate=2e-2`\
+The effects of this were evident. One step took around 10 minutes due to the increased `gradient_accumulation_steps`, and it required significant amount of memory from the GPU due to `per_device_train_batch_size=20`. It also overfitted just in 15 steps, achieving `loss=0`, due to the high learning rate. We wanted to try if the dropout could prevent overfitting while at the same time having a high learning rate, but it could not.
 Both models have a max sequence length of 2048 tokens. This means that they only process the 2048 first tokens in the input.
 We chose float16 as the quantization method as it according to [Unsloth wiki](https://github.com/unslothai/unsloth/wiki) has the fastest conversion and retains 100% accuracy. However, it is slow and memory hungry which is a disadvantage.

app.py CHANGED Viewed

@@ -15,9 +15,7 @@ def generate_response(model, prompt):
     return response["choices"][0]["text"]
 # Evaluate responses using the LoRA evaluation model
-def evaluate_responses(prompt, repo_a, model_a, repo_b, model_b, evaluation_criteria):
-    if len(evaluation_criteria) > 3:
-        return "Error: Please select up to 3 evaluation criteria only."
     # Load models
     model_a_instance = load_user_model(repo_a, model_a)

     return response["choices"][0]["text"]
 # Evaluate responses using the LoRA evaluation model
+def evaluate_responses(prompt, repo_a, model_a, repo_b, model_b):
     # Load models
     model_a_instance = load_user_model(repo_a, model_a)