Spaces:
Sleeping
Sleeping
File size: 3,184 Bytes
1c65fd3 7841304 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
---
title: LLM As A Judge
emoji: π
colorFrom: red
colorTo: pink
sdk: gradio
sdk_version: 5.8.0
app_file: app.py
pinned: false
short_description: Compare the performance of different models.
---
# LLM As A Judge π
**LLM As A Judge** is a Gradio-based application that allows users to compare the performance of different LLaMA models saved in the GGUF format on a given prompt. It generates responses from two user-specified models, evaluates their performance based on user-selected criteria, and declares a winner using a fine-tuned evaluation model.
## Features β¨
- **User-Specified Models**: Compare any two LLaMA models by providing their Hugging Face repository and model filenames.
- **Custom Prompts**: Test models with any prompt of your choice.
- **Evaluation Criteria**: Select from predefined criteria such as clarity, completeness, accuracy, relevance, user-friendliness, depth, or creativity.
- **Objective Evaluation**: Employs a specialized evaluation model fine-tuned to assess instruction-based responses.
## Requirements βοΈ
- Only supports **LLaMA models** saved in **GGUF format**.
- Models must be hosted on Hugging Face and accessible via their repository names and filenames.
## How It Works π οΈ
1. **Input Model Details**: Provide the repository names and filenames for both models.
2. **Input Prompt**: Enter the prompt to generate responses.
3. **Select Evaluation Criteria**: Choose an evaluation criterion (e.g., clarity or relevance).
4. **Generate Responses and Evaluate**:
- The app downloads and loads the specified models.
- Responses are generated for the given prompt using both models.
- The **LoRA-4100 evaluation model** evaluates the responses based on the selected criteria.
5. **View Results**: Ratings, detailed explanations, and the declared winner or draw are displayed.
## Behind the Scenes π
- **Evaluation Model**: The app uses the **LoRA-4100** model, a LLaMA 3.2 3B model fine-tuned on an instruction dataset, to objectively evaluate the responses.
- **Dynamic Model Loading**: The app downloads and loads models from Hugging Face dynamically based on user input.
- **Inference**: Both user-specified models generate responses for the prompt, which are then evaluated by the LoRA-4100 model.
## Example π
**Input:**
- **Model A Repository**: `KolumbusLindh/LoRA-4100`
- **Model A Filename**: `unsloth.F16.gguf`
- **Model B Repository**: `forestav/gguf_lora_model`
- **Model B Filename**: `finetune_v2.gguf`
- **Prompt**: *"Explain the significance of the Turing Test in artificial intelligence."*
- **Evaluation Criterion**: Clarity
**Output:**
- Detailed evaluation results with scores for each model's response.
- Explanations for the scores based on the selected criterion.
- Declaration of the winning model or a draw.
## Limitations π§
- Only works with **LLaMA models in GGUF format**.
- The evaluation model is optimized for instruction-based responses and may not generalize well to other tasks.
## Configuration Reference π
For detailed information on configuring a Hugging Face Space, visit the [Spaces Config Reference](https://huggingface.co/docs/hub/spaces-config-reference).
|