Filip commited on
Commit
afc14cf
·
1 Parent(s): 048ca6a

update readme

Browse files
Files changed (2) hide show
  1. README.md +63 -0
  2. app.py +1 -1
README.md CHANGED
@@ -8,3 +8,66 @@ sdk_version: 5.0.1
8
  app_file: app.py
9
  pinned: false
10
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  app_file: app.py
9
  pinned: false
10
  ---
11
+
12
+ This is a space where you can compare two models using the technique "LLM as a Judge". LLM as a Judge uses a LLM itself for judging the response from two LLMs, and compare them based on certain evaluation metrics which are relevant for the task.
13
+
14
+ In this space, our default placeholder repos and models compare two LLMs finetuned on the same base model, [Llama 3.2 3B parameter model](unsloth/Llama-3.2-3B-Instruct). Both of them are finetuned on the FineTome-100k dataset, but they have been finetuned on a different amount of data.
15
+
16
+ The models were finetuned using [Unsloth](https://unsloth.ai/), a framework which allows finetuning, training and inference with LLMs 2x faster.
17
+
18
+ ## Default models and their hyperparameters
19
+
20
+ ### forestav/LoRA-2000
21
+
22
+ Finetuned on 2000 steps
23
+ Quantization method: float16
24
+
25
+ ### KolumbusLindh/LoRA-4100
26
+
27
+ Finetuned on 4100 steps
28
+ Quantization method: float16
29
+
30
+ ### Hyperparameters
31
+
32
+ Both models used the same hyperparameters during training.
33
+ `per_device_train_batch_size = 2`
34
+ `gradient_accumulation_steps=4`
35
+ `learning_rate=2e-4`
36
+ `optim="adamw_8bit"`
37
+ `weight_decay=0.01`
38
+ `lr_scheduler_type="linear"`
39
+
40
+ We chose float16 as the quantization method as it has the fastest conversion and retains 100% accuracy. However, it is slow and memory hungry which is a disadvantage.
41
+ Source: https://github.com/unslothai/unsloth/wiki
42
+
43
+ ## Judge
44
+
45
+ We are using the KolumbusLindh/LoRA-4100 model as a judge. However, for better accuracy one should use a stronger model such as GPT-4, which can evaluate the responses more thoroughly.
46
+
47
+ ## Evaluation using GPT-4
48
+
49
+ To better evaluate our fine-tuned models, we let GPT-4 be our judge, when the respective model answered the following prompts:
50
+
51
+ 1. Describe step-by-step how to set up a tent in a windy environment.
52
+
53
+ 2. How-To Guidance: "Explain how to bake a chocolate cake without using eggs."
54
+
55
+ 3. Troubleshooting: "Provide instructions for troubleshooting a laptop that won’t turn on."
56
+
57
+ 4. Educational Explanation: "Teach a beginner how to solve a Rubik’s Cube in simple steps."
58
+
59
+ 5. DIY Project: "Give detailed instructions for building a birdhouse using basic tools."
60
+
61
+ 6. Fitness Routine: "Design a beginner-friendly 15-minute workout routine that requires no equipment."
62
+
63
+ 7. Cooking Tips: "Explain how to properly season and cook a medium-rare steak."
64
+
65
+ 8. Technical Guidance: "Write a step-by-step guide for setting up a local Git repository and pushing code to GitHub."
66
+
67
+ 9. Emergency Response: "Provide instructions for administering first aid to someone with a sprained ankle."
68
+
69
+ 10. Language Learning: "Outline a simple plan for a beginner to learn Spanish in 30 days."
70
+
71
+ ### Results
72
+
73
+ #### Prompt 1: Describe step-by-step how to set up a tent in a windy environment.
app.py CHANGED
@@ -75,7 +75,7 @@ with gr.Blocks(title="LLM as a Judge") as demo:
75
  gr.Markdown("Welcome to the LLM as a Judge demo! This application uses the LoRA model to evaluate responses generated by two different models based on user-specified criteria. You can select up to 3 evaluation criteria and provide a prompt to generate responses from the models. The LoRA model will then evaluate the responses based on the selected criteria and determine the winner.")
76
 
77
  # Model inputs
78
- repo_a_input = gr.Textbox(label="Model A Repository", placeholder="Enter the Hugging Face repo name for Model A...", value="forestav/gguf_lora_model")
79
  model_a_input = gr.Textbox(label="Model A File Name", placeholder="Enter the model filename for Model A...", value="unsloth.F16.gguf")
80
  repo_b_input = gr.Textbox(label="Model B Repository", placeholder="Enter the Hugging Face repo name for Model B...", value="KolumbusLindh/LoRA-4100")
81
  model_b_input = gr.Textbox(label="Model B File Name", placeholder="Enter the model filename for Model B...", value="unsloth.F16.gguf")
 
75
  gr.Markdown("Welcome to the LLM as a Judge demo! This application uses the LoRA model to evaluate responses generated by two different models based on user-specified criteria. You can select up to 3 evaluation criteria and provide a prompt to generate responses from the models. The LoRA model will then evaluate the responses based on the selected criteria and determine the winner.")
76
 
77
  # Model inputs
78
+ repo_a_input = gr.Textbox(label="Model A Repository", placeholder="Enter the Hugging Face repo name for Model A...", value="forestav/LoRA-2000")
79
  model_a_input = gr.Textbox(label="Model A File Name", placeholder="Enter the model filename for Model A...", value="unsloth.F16.gguf")
80
  repo_b_input = gr.Textbox(label="Model B Repository", placeholder="Enter the Hugging Face repo name for Model B...", value="KolumbusLindh/LoRA-4100")
81
  model_b_input = gr.Textbox(label="Model B File Name", placeholder="Enter the model filename for Model B...", value="unsloth.F16.gguf")