kaleinaNyan's picture
refactor: update llmarena link
274cf92 verified
---
license: apache-2.0
language:
- ru
- en
base_model:
- jinaai/jina-embeddings-v3
---
## **JinaJudge: Proxy Judgement for Russian LLM Arena**
### **Description**
This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the [Russian LLM Arena](https://huggingface.co/spaces/Vikhrmodels/arenahardlb), designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models.
---
### **Model Details**
This is a small upgrade to the [kaleinaNyan/jina-v3-rullmarena-judge](https://huggingface.co/kaleinaNyan/jina-v3-rullmarena-judge) model:
- Number of decoder blocks increased from 4 to 5.
- Hidden activations dimensionality reduced from 1024 to 512 (via a projection layer after JINA encoder).
- The resulting model size went from 614M params to 589M params.
- I also tweaked some training hyperparameters, but training data composition is the same.
Surprisingly, these changes gave a tangible performance improvement, so I decided to upload the model. As it turned out (after evaluation on the train set), previous model was not expressive enough.
---
### **Evaluation**
The validation process was based on **existing judgements** from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training.
NOTE: values in parenthesis show relative improvement compared to previous model.
**Models evaluated**:
- **gemma-2-9b-it-sppo-iter3**
- **glm-4-9b-chat**
- **gpt-3.5-turbo-1106**
- **mistral-7b-instruct-v0.3**
- **storm-7b**
**Validation Performance**:
- **Accuracy**: 80.76% (+2.67)
- **Precision**: 78.56% (+2.74)
- **Recall**: 79.48% (+2.71)
- **F1-score**: 79.00% (+2.73)
For the **test** phase, new judgements were generated using GPT-4 for the `kolibri-mistral-0427-upd` model.
**Test Performance**:
- **Accuracy**: 82.72% (+2.64)
- **Precision**: 80.11% (+3.43)
- **Recall**: 82.42% (+4.69)
- **F1-score**: 81.18% (+4.10)
---
### **Usage Example**
```python
from transformers import AutoModel
jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge-300924", trust_remote_code=True)
prompt_template = """
<user prompt>
{user_prompt}
<end>
<assistant A answer>
{assistant_a}
<end>
<assistant B answer>
{assistant_b}
<end>
""".strip()
prompt = "your prompt"
assistant_a = "assistant a response"
assistant_b = "assistant b response"
example = prompt_template.format(
user_prompt=user_prompt,
assistant_a=assistant_a,
assistant_b=assistant_b,
)
judgement = jina([example])[0].argmax()
judgement_map = {
0: "A is better than B",
1: "A == B",
2: "B is better than A"
}
print(judgement_map[judgement])
```
---
### **Generated ranking**
The ranking was obtained using a modified [Russian LLM Arena code](https://github.com/oKatanaaa/ru_llm_arena).
All judgements were regenerated using the jina-judge model.
| Model | Score | 95% CI | Average #Tokens |
|--------------------------------------|-------|----------------------|-----------------|
| gpt-4-1106-preview | 81.6 | (-2.3, 3.0) | 541 |
| gpt-4.0-mini | 76.0 | (-2.7, 2.4) | 448 |
| qwen-2.5-72b-it | 72.5 | (-3.6, 3.6) | 557 |
| gemma-2-9b-it-sppo-iter3 | 72.1 | (-3.7, 3.6) | 569 |
| gemma-2-27b-it | 71.1 | (-3.3, 3.2) | 482 |
| gemma-2-9b-it | 70.8 | (-3.4, 3.5) | 569 |
| t-lite-instruct-0.1 | 68.3 | (-3.8, 4.5) | 810 |
| suzume-llama-3-8b-multilingual-orpo | 62.9 | (-3.9, 4.0) | 682 |
| glm-4-9b-chat | 60.5 | (-3.9, 4.0) | 516 |
| sfr-iterative-dpo-llama-3-8b-r | 59.9 | (-4.0, 4.3) | 682 |
| c4ai-command-r-v01 | 56.9 | (-4.2, 3.8) | 516 |
| phi-3-medium-4k-instruct | 56.4 | (-2.8, 3.3) | 566 |
| mistral-nemo-instruct-2407 | 56.1 | (-2.9, 3.4) | 682 |
| yandex_gpt_pro | 51.7 | (-3.4, 3.4) | 345 |
| suzume-llama-3-8b-multilingual | 51.3 | (-3.4, 4.0) | 489 |
| hermes-2-theta-llama-3-8b | 50.9 | (-3.2, 3.4) | 485 |
| starling-1m-7b-beta | 50.2 | (-3.3, 3.4) | 495 |
| gpt-3.5-turbo-0125 | 50.0 | (0.0, 0.0) | 220 |
| llama-3-instruct-8b-sppo-iter3 | 49.8 | (-3.4, 4.0) | 763 |
| llama-3-8b-saiga-suzume-ties | 48.2 | (-4.1, 3.9) | 569 |
| llama-3-smaug-8b | 46.6 | (-3.9, 3.8) | 763 |
| vikhr-it-5.4-fp16-orpo-v2 | 46.6 | (-3.7, 4.0) | 379 |
| aya-23-8b | 46.3 | (-3.8, 3.9) | 571 |
| saiga-llama3-8b_v6 | 45.5 | (-3.8, 3.9) | 471 |
| vikhr-it-5.2-fp16-cp | 43.8 | (-3.9, 4.0) | 543 |
| qwen2-7b-instruct | 43.7 | (-2.5, 2.7) | 492 |
| opencchat-3.5-0106 | 43.4 | (-3.3, 3.7) | 485 |
| gpt-3.5-turbo-1106 | 41.7 | (-2.9, 3.5) | 220 |
| kolibri-mistral-0427-upd | 41.5 | (-3.2, 3.5) | 551 |
| paralex-llama-3-8b-sft | 40.6 | (-3.8, 3.3) | 688 |
| mistral-7b-instruct-v0.3 | 40.3 | (-3.3, 3.4) | 469 |
| llama-3-instruct-8b-simpo | 40.2 | (-2.9, 3.7) | 551 |
| gigachat_pro | 40.2 | (-3.2, 3.5) | 294 |
| hermes-2-pro-llama-3-8b | 39.5 | (-2.9, 3.4) | 689 |
| vikhr-it-5.3-fp16-32k | 39.5 | (-2.8, 3.2) | 519 |
| opencchat-3.6-8b-2204522 | 37.7 | (-3.3, 3.7) | 409 |
| meta-llama-3-8b-instruct | 37.5 | (-3.1, 3.5) | 450 |
| kolibri-vikhr-mistral-0427 | 37.1 | (-3.1, 3.8) | 488 |
| neural-chat-v3.3 | 36.5 | (-2.7, 3.6) | 523 |
| vikhr-it-5.1-fp16 | 36.4 | (-3.5, 3.5) | 448 |
| gigachat-lite | 36.0 | (-2.8, 3.0) | 523 |
| saiga-7b | 25.9 | (-3.1, 3.7) | 927 |
| storm-7b | 25.1 | (-3.6, 4.1) | 419 |
| snorkel-mistral-pairrm-dpo | 16.5 | (-3.8, 3.2) | 773 |