---
license: apache-2.0
language:
- ru
- en
base_model:
- jinaai/jina-embeddings-v3
---

## **JinaJudge: Proxy Judgement for Russian LLM Arena**

### **Description**
This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the [Russian LLM Arena](https://huggingface.co/spaces/Vikhrmodels/arenahardlb), designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models.

---

### **Model Details**

This is an iterative update of [kaleinaNyan/jina-v3-rullmarena-judge-300924](https://huggingface.co/kaleinaNyan/jina-v3-rullmarena-judge-300924) model:
- Increased amount of training data (not by much, approaximately 1.5x times).
- Updated data composition to fix erroneous judgements where GPT-4 picked English responses over Russian ones.
- Validation set was updated as well to exclude such errors.
- Test set did not change (no bad judgements in that regard).

---

### **Evaluation**
The validation process was based on **existing judgements** from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training.

NOTE: values in parenthesis show relative improvement compared to previous model.

**Models evaluated**:
- **gemma-2-9b-it-sppo-iter3**
- **glm-4-9b-chat**
- **gpt-3.5-turbo-1106**
- **mistral-7b-instruct-v0.3**
- **storm-7b**

**Validation Performance (old validation set)**:
- **Accuracy**: 79.97% (-0.78)
- **Precision**: 78.25% (-0.31)
- **Recall**: 78.25% (-1.23)
- **F1-score**: 78.25% (-0.75)

NOTE: will report later what actually caused the drop (the subset of fixed judgements or smth else)

**Validation Performance (new validation set)**:
- **Accuracy**: 83.59% (+2.48)
- **Precision**: 80.97% (+2.14)
- **Recall**: 80.97% (+1.22)
- **F1-score**: 80.97% (+1.77)

For the **test** phase, new judgements were generated using GPT-4 for the `kolibri-mistral-0427-upd` model.

**Test Performance**:
- **Accuracy**: 85.09% (+2.37)
- **Precision**: 83.20% (+3.09)
- **Recall**: 83.20% (+0.78)
- **F1-score**: 83.20% (+2.02)

---

### **Usage Example**

```python
from transformers import AutoModel

jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge-041024", trust_remote_code=True)

prompt_template = """
<user prompt>
{user_prompt}
<end>
<assistant A answer>
{assistant_a}
<end>
<assistant B answer>
{assistant_b}
<end>
""".strip()

prompt = "your prompt"
assistant_a = "assistant a response"
assistant_b = "assistant b response"

example = prompt_template.format(
    user_prompt=user_prompt,
    assistant_a=assistant_a,
    assistant_b=assistant_b,
)

judgement = jina([example])[0].argmax()

judgement_map = {
  0: "A is better than B",
  1: "A == B",
  2: "B is better than A"
}

print(judgement_map[judgement])
```

---

### **Generated ranking**

The ranking was obtained using a modified [Russian LLM Arena code](https://github.com/oKatanaaa/ru_llm_arena). 
All judgements were regenerated using the jina-judge model. It takes about 16 minutes to regenerate the whole board (or 23 seconds per model) on an RTX3090.


| Model                                            | Score | 95% CI               | Average #Tokens |
|--------------------------------------------------|-------|----------------------|-----------------|
| gpt-4-1106-preview                               | 82.8  | (-2.2, 2.3)          | 541             |
| gpt-4o-mini                                      | 75.3  | (-2.5, 2.9)          | 448             |
| qwen-2.5-72b-it                                  | 73.1  | (-3.4, 3.1)          | 557             |
| gemma-2-9b-it-sppo-iter3                         | 70.6  | (-3.9, 2.8)          | 509             |
| gemma-2-27b-it                                   | 68.7  | (-2.8, 3.8)          | 472             |
| t-lite-instruct-0.1                              | 67.5  | (-3.8, 3.8)          | 810             |
| gemma-2-9b-it                                    | 67.0  | (-3.7, 3.3)          | 459             |
| suzume-llama-3-8B-multilingual-orpo-borda-half   | 62.4  | (-3.5, 3.7)          | 682             |
| glm-4-9b-chat                                    | 61.5  | (-3.7, 3.0)          | 568             |
| phi-3-medium-4k-instruct                         | 60.4  | (-3.5, 3.7)          | 566             |
| sfr-iterative-dpo-llama-3-8b-r                   | 57.2  | (-3.9, 2.2)          | 516             |
| c4ai-command-r-v01                               | 55.0  | (-3.9, 3.1)          | 529             |
| suzume-llama-3-8b-multilingual                   | 51.9  | (-2.8, 3.7)          | 641             |
| mistral-nemo-instruct-2407                       | 51.9  | (-3.8, 3.7)          | 403             |
| yandex_gpt_pro                                   | 50.3  | (-3.4, 3.1)          | 345             |
| gpt-3.5-turbo-0125                               | 50.0  | (0.0, 0.0)           | 220             |
| hermes-2-theta-llama-3-8b                        | 49.3  | (-3.4, 3.9)          | 485             |
| starling-lm-7b-beta                              | 48.3  | (-3.8, 4.0)          | 629             |
| llama-3-8b-saiga-suzume-ties                     | 47.9  | (-3.9, 5.0)          | 763             |
| llama-3-smaug-8b                                 | 47.6  | (-3.6, 3.1)          | 524             |
| vikhr-it-5.4-fp16-orpo-v2                        | 46.8  | (-2.5, 2.7)          | 379             |
| aya-23-8b                                        | 46.1  | (-3.9, 3.9)          | 554             |
| saiga_llama3_8b_v6                               | 44.8  | (-3.4, 3.3)          | 471             |
| qwen2-7b-instruct                                | 43.6  | (-3.0, 2.7)          | 340             |
| vikhr-it-5.2-fp16-cp                             | 43.6  | (-4.1, 3.3)          | 543             |
| openchat-3.5-0106                                | 42.8  | (-3.9, 3.3)          | 492             |
| kolibri-mistral-0427-upd                         | 42.3  | (-4.2, 3.2)          | 551             |
| paralex-llama-3-8b-sft                           | 41.8  | (-3.2, 3.7)          | 688             |
| llama-3-instruct-8b-sppo-iter3                   | 41.7  | (-3.4, 3.3)          | 502             |
| gpt-3.5-turbo-1106                               | 41.5  | (-2.9, 2.1)          | 191             |
| mistral-7b-instruct-v0.3                         | 41.1  | (-4.3, 3.5)          | 469             |
| gigachat_pro                                     | 40.9  | (-3.4, 3.6)          | 294             |
| openchat-3.6-8b-20240522                         | 39.1  | (-3.2, 4.1)          | 428             |
| vikhr-it-5.3-fp16-32k                            | 38.8  | (-3.5, 3.3)          | 519             |
| hermes-2-pro-llama-3-8b                          | 38.4  | (-3.2, 3.1)          | 463             |
| kolibri-vikhr-mistral-0427                       | 34.5  | (-2.9, 3.5)          | 489             |
| vikhr-it-5.3-fp16                                | 33.5  | (-3.5, 3.8)          | 523             |
| llama-3-instruct-8b-simpo                        | 32.7  | (-3.9, 3.6)          | 417             |
| meta-llama-3-8b-instruct                         | 32.1  | (-3.4, 3.3)          | 450             |
| neural-chat-7b-v3-3                              | 25.9  | (-2.7, 3.6)          | 927             |
| gigachat_lite                                    | 25.4  | (-2.8, 2.5)          | 276             |
| snorkel-mistral-pairrm-dpo                       | 10.3  | (-2.0, 2.3)          | 773             |
| storm-7b                                         |  3.7  | (-1.3, 1.6)          | 419             |