Spaces:

KolumbusLindh
/

LLM-as-a-judge

Sleeping

File size: 3,184 Bytes

---
title: LLM As A Judge
emoji: 📚
colorFrom: red
colorTo: pink
sdk: gradio
sdk_version: 5.8.0
app_file: app.py
pinned: false
short_description: Compare the performance of different models.
---

# LLM As A Judge 📚

**LLM As A Judge** is a Gradio-based application that allows users to compare the performance of different LLaMA models saved in the GGUF format on a given prompt. It generates responses from two user-specified models, evaluates their performance based on user-selected criteria, and declares a winner using a fine-tuned evaluation model.

## Features ✨
- **User-Specified Models**: Compare any two LLaMA models by providing their Hugging Face repository and model filenames.
- **Custom Prompts**: Test models with any prompt of your choice.
- **Evaluation Criteria**: Select from predefined criteria such as clarity, completeness, accuracy, relevance, user-friendliness, depth, or creativity.
- **Objective Evaluation**: Employs a specialized evaluation model fine-tuned to assess instruction-based responses.

## Requirements ⚙️
- Only supports **LLaMA models** saved in **GGUF format**.
- Models must be hosted on Hugging Face and accessible via their repository names and filenames.

## How It Works 🛠️
1. **Input Model Details**: Provide the repository names and filenames for both models.
2. **Input Prompt**: Enter the prompt to generate responses.
3. **Select Evaluation Criteria**: Choose an evaluation criterion (e.g., clarity or relevance).
4. **Generate Responses and Evaluate**:
   - The app downloads and loads the specified models.
   - Responses are generated for the given prompt using both models.
   - The **LoRA-4100 evaluation model** evaluates the responses based on the selected criteria.
5. **View Results**: Ratings, detailed explanations, and the declared winner or draw are displayed.

## Behind the Scenes 🔍
- **Evaluation Model**: The app uses the **LoRA-4100** model, a LLaMA 3.2 3B model fine-tuned on an instruction dataset, to objectively evaluate the responses.
- **Dynamic Model Loading**: The app downloads and loads models from Hugging Face dynamically based on user input.
- **Inference**: Both user-specified models generate responses for the prompt, which are then evaluated by the LoRA-4100 model.

## Example 🌟
**Input:**
- **Model A Repository**: `KolumbusLindh/LoRA-4100`
- **Model A Filename**: `unsloth.F16.gguf`
- **Model B Repository**: `forestav/gguf_lora_model`
- **Model B Filename**: `finetune_v2.gguf`
- **Prompt**: *"Explain the significance of the Turing Test in artificial intelligence."*
- **Evaluation Criterion**: Clarity

**Output:**
- Detailed evaluation results with scores for each model's response.
- Explanations for the scores based on the selected criterion.
- Declaration of the winning model or a draw.

## Limitations 🚧
- Only works with **LLaMA models in GGUF format**.
- The evaluation model is optimized for instruction-based responses and may not generalize well to other tasks.

## Configuration Reference 📖
For detailed information on configuring a Hugging Face Space, visit the [Spaces Config Reference](https://huggingface.co/docs/hub/spaces-config-reference).