Spaces:

forestav
/

llm-as-a-judge

Sleeping

App Files Files Community

Filip commited on Dec 10, 2024

Commit

96c0b50

1 Parent(s): 6bd03c4

update

Browse files

Files changed (3) hide show

.gitignore +1 -0
README.md +5 -1
app.py +6 -10

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ venv

README.md CHANGED Viewed

@@ -30,13 +30,17 @@ Quantization method: `float16`
 ### Hyperparameters
 Both models used the same hyperparameters during training.\
-`per_device_train_batch_size = 2`\
 `gradient_accumulation_steps=4`\
 `learning_rate=2e-4`\
 `optim="adamw_8bit"`\
 `weight_decay=0.01`\
 `lr_scheduler_type="linear"`
 We chose float16 as the quantization method as it according to [Unsloth wiki](https://github.com/unslothai/unsloth/wiki) has the fastest conversion and retains 100% accuracy. However, it is slow and memory hungry which is a disadvantage.
 ## Judge

 ### Hyperparameters
 Both models used the same hyperparameters during training.\
+`lora_alpha=16`
+`lora_dropout=0`
+`per_device_train_batch_size=2`\
 `gradient_accumulation_steps=4`\
 `learning_rate=2e-4`\
 `optim="adamw_8bit"`\
 `weight_decay=0.01`\
 `lr_scheduler_type="linear"`
+Both models have a max sequence length of 2048 tokens. This means that they only process the 2048 first tokens in the input.
 We chose float16 as the quantization method as it according to [Unsloth wiki](https://github.com/unslothai/unsloth/wiki) has the fastest conversion and retains 100% accuracy. However, it is slow and memory hungry which is a disadvantage.
 ## Judge

app.py CHANGED Viewed

@@ -32,14 +32,13 @@ def evaluate_responses(prompt, repo_a, model_a, repo_b, model_b, evaluation_crit
     print(f"Response B: {response_b}")
     # Format the evaluation prompt
-    criteria_list = ", ".join(evaluation_criteria)
     evaluation_prompt = f"""
 Prompt: {prompt}
 Response A: {response_a}
 Response B: {response_b}
-Evaluation Criteria: {criteria_list}
 Please evaluate the responses based on the selected criteria. For each criterion, rate both responses on a scale from 1 to 4 and provide a justification. Finally, declare the winner (or 'draw' if they are equal).
 """
@@ -53,7 +52,7 @@ Please evaluate the responses based on the selected criteria. For each criterion
     # Combine results for display
     final_output = f"""
-Evaluation Results:\n{evaluation_results}
 """
     return final_output, response_a, response_b
@@ -82,10 +81,6 @@ with gr.Blocks(title="LLM as a Judge") as demo:
     # Prompt and criteria inputs
     prompt_input = gr.Textbox(label="Enter Prompt", placeholder="Enter the prompt here...", lines=3)
-    criteria_dropdown = gr.CheckboxGroup(
-        label="Select Up to 3 Evaluation Criteria",
-        choices=["Clarity", "Completeness", "Accuracy", "Relevance", "User-Friendliness", "Depth", "Creativity"]
-    )
     # Button and outputs
     evaluate_button = gr.Button("Evaluate Models")
@@ -94,7 +89,7 @@ with gr.Blocks(title="LLM as a Judge") as demo:
         with gr.Column():
             response_a = gr.Textbox(
                 label="Response A",
-                placeholder="The response for Model A will appear here...",
                 lines=20,
                 interactive=False
             )
@@ -102,11 +97,12 @@ with gr.Blocks(title="LLM as a Judge") as demo:
         with gr.Column():
             response_b = gr.Textbox(
                 label="Response B",
-                placeholder="The response for Model B will appear here...",
                 lines=20,
                 interactive=False
             )
     evaluation_output = gr.Textbox(
         label="Evaluation Results",
         placeholder="The evaluation results will appear here...",
@@ -117,7 +113,7 @@ with gr.Blocks(title="LLM as a Judge") as demo:
     # Link evaluation function
     evaluate_button.click(
         fn=evaluate_responses,
-        inputs=[prompt_input, repo_a_input, model_a_input, repo_b_input, model_b_input, criteria_dropdown],
         outputs=[evaluation_output, response_a, response_b]
     )

     print(f"Response B: {response_b}")
     # Format the evaluation prompt
     evaluation_prompt = f"""
 Prompt: {prompt}
 Response A: {response_a}
 Response B: {response_b}
+Evaluation Criteria: Relevance, Coherence and Completeness
 Please evaluate the responses based on the selected criteria. For each criterion, rate both responses on a scale from 1 to 4 and provide a justification. Finally, declare the winner (or 'draw' if they are equal).
 """
     # Combine results for display
     final_output = f"""
+{evaluation_results}
 """
     return final_output, response_a, response_b
     # Prompt and criteria inputs
     prompt_input = gr.Textbox(label="Enter Prompt", placeholder="Enter the prompt here...", lines=3)
     # Button and outputs
     evaluate_button = gr.Button("Evaluate Models")
         with gr.Column():
             response_a = gr.Textbox(
                 label="Response A",
+                placeholder="The response from Model A will appear here...",
                 lines=20,
                 interactive=False
             )
         with gr.Column():
             response_b = gr.Textbox(
                 label="Response B",
+                placeholder="The response from Model B will appear here...",
                 lines=20,
                 interactive=False
             )
+    gr.Markdown("### The LLMs are evaluated based on the criterion of Relevance, Coherence and Completeness.")
     evaluation_output = gr.Textbox(
         label="Evaluation Results",
         placeholder="The evaluation results will appear here...",
     # Link evaluation function
     evaluate_button.click(
         fn=evaluate_responses,
+        inputs=[prompt_input, repo_a_input, model_a_input, repo_b_input, model_b_input],
         outputs=[evaluation_output, response_a, response_b]
     )