Spaces:
Sleeping
Sleeping
Filip
commited on
Commit
·
afc14cf
1
Parent(s):
048ca6a
update readme
Browse files
README.md
CHANGED
@@ -8,3 +8,66 @@ sdk_version: 5.0.1
|
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
---
|
11 |
+
|
12 |
+
This is a space where you can compare two models using the technique "LLM as a Judge". LLM as a Judge uses a LLM itself for judging the response from two LLMs, and compare them based on certain evaluation metrics which are relevant for the task.
|
13 |
+
|
14 |
+
In this space, our default placeholder repos and models compare two LLMs finetuned on the same base model, [Llama 3.2 3B parameter model](unsloth/Llama-3.2-3B-Instruct). Both of them are finetuned on the FineTome-100k dataset, but they have been finetuned on a different amount of data.
|
15 |
+
|
16 |
+
The models were finetuned using [Unsloth](https://unsloth.ai/), a framework which allows finetuning, training and inference with LLMs 2x faster.
|
17 |
+
|
18 |
+
## Default models and their hyperparameters
|
19 |
+
|
20 |
+
### forestav/LoRA-2000
|
21 |
+
|
22 |
+
Finetuned on 2000 steps
|
23 |
+
Quantization method: float16
|
24 |
+
|
25 |
+
### KolumbusLindh/LoRA-4100
|
26 |
+
|
27 |
+
Finetuned on 4100 steps
|
28 |
+
Quantization method: float16
|
29 |
+
|
30 |
+
### Hyperparameters
|
31 |
+
|
32 |
+
Both models used the same hyperparameters during training.
|
33 |
+
`per_device_train_batch_size = 2`
|
34 |
+
`gradient_accumulation_steps=4`
|
35 |
+
`learning_rate=2e-4`
|
36 |
+
`optim="adamw_8bit"`
|
37 |
+
`weight_decay=0.01`
|
38 |
+
`lr_scheduler_type="linear"`
|
39 |
+
|
40 |
+
We chose float16 as the quantization method as it has the fastest conversion and retains 100% accuracy. However, it is slow and memory hungry which is a disadvantage.
|
41 |
+
Source: https://github.com/unslothai/unsloth/wiki
|
42 |
+
|
43 |
+
## Judge
|
44 |
+
|
45 |
+
We are using the KolumbusLindh/LoRA-4100 model as a judge. However, for better accuracy one should use a stronger model such as GPT-4, which can evaluate the responses more thoroughly.
|
46 |
+
|
47 |
+
## Evaluation using GPT-4
|
48 |
+
|
49 |
+
To better evaluate our fine-tuned models, we let GPT-4 be our judge, when the respective model answered the following prompts:
|
50 |
+
|
51 |
+
1. Describe step-by-step how to set up a tent in a windy environment.
|
52 |
+
|
53 |
+
2. How-To Guidance: "Explain how to bake a chocolate cake without using eggs."
|
54 |
+
|
55 |
+
3. Troubleshooting: "Provide instructions for troubleshooting a laptop that won’t turn on."
|
56 |
+
|
57 |
+
4. Educational Explanation: "Teach a beginner how to solve a Rubik’s Cube in simple steps."
|
58 |
+
|
59 |
+
5. DIY Project: "Give detailed instructions for building a birdhouse using basic tools."
|
60 |
+
|
61 |
+
6. Fitness Routine: "Design a beginner-friendly 15-minute workout routine that requires no equipment."
|
62 |
+
|
63 |
+
7. Cooking Tips: "Explain how to properly season and cook a medium-rare steak."
|
64 |
+
|
65 |
+
8. Technical Guidance: "Write a step-by-step guide for setting up a local Git repository and pushing code to GitHub."
|
66 |
+
|
67 |
+
9. Emergency Response: "Provide instructions for administering first aid to someone with a sprained ankle."
|
68 |
+
|
69 |
+
10. Language Learning: "Outline a simple plan for a beginner to learn Spanish in 30 days."
|
70 |
+
|
71 |
+
### Results
|
72 |
+
|
73 |
+
#### Prompt 1: Describe step-by-step how to set up a tent in a windy environment.
|
app.py
CHANGED
@@ -75,7 +75,7 @@ with gr.Blocks(title="LLM as a Judge") as demo:
|
|
75 |
gr.Markdown("Welcome to the LLM as a Judge demo! This application uses the LoRA model to evaluate responses generated by two different models based on user-specified criteria. You can select up to 3 evaluation criteria and provide a prompt to generate responses from the models. The LoRA model will then evaluate the responses based on the selected criteria and determine the winner.")
|
76 |
|
77 |
# Model inputs
|
78 |
-
repo_a_input = gr.Textbox(label="Model A Repository", placeholder="Enter the Hugging Face repo name for Model A...", value="forestav/
|
79 |
model_a_input = gr.Textbox(label="Model A File Name", placeholder="Enter the model filename for Model A...", value="unsloth.F16.gguf")
|
80 |
repo_b_input = gr.Textbox(label="Model B Repository", placeholder="Enter the Hugging Face repo name for Model B...", value="KolumbusLindh/LoRA-4100")
|
81 |
model_b_input = gr.Textbox(label="Model B File Name", placeholder="Enter the model filename for Model B...", value="unsloth.F16.gguf")
|
|
|
75 |
gr.Markdown("Welcome to the LLM as a Judge demo! This application uses the LoRA model to evaluate responses generated by two different models based on user-specified criteria. You can select up to 3 evaluation criteria and provide a prompt to generate responses from the models. The LoRA model will then evaluate the responses based on the selected criteria and determine the winner.")
|
76 |
|
77 |
# Model inputs
|
78 |
+
repo_a_input = gr.Textbox(label="Model A Repository", placeholder="Enter the Hugging Face repo name for Model A...", value="forestav/LoRA-2000")
|
79 |
model_a_input = gr.Textbox(label="Model A File Name", placeholder="Enter the model filename for Model A...", value="unsloth.F16.gguf")
|
80 |
repo_b_input = gr.Textbox(label="Model B Repository", placeholder="Enter the Hugging Face repo name for Model B...", value="KolumbusLindh/LoRA-4100")
|
81 |
model_b_input = gr.Textbox(label="Model B File Name", placeholder="Enter the model filename for Model B...", value="unsloth.F16.gguf")
|