Safetensors
llama
File size: 7,244 Bytes
2b642ae
 
 
f8e9886
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
license: apache-2.0
---
# lmarena-ai/p2l-135m-grk-01112025

Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance.
To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt.
The core idea is to train an LLM taking natural language prompts as input to output a vector of coefficients which are then used to predict the human preference vote.
The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses. 
Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard.

**Paper**: [Prompt-to-Leaderboard](https://arxiv.org/abs/2502.14855)

**Code**: [lmarena/p2l](https://github.com/lmarena/p2l)

This particular P2L model has a *Grounded Rao-Kupper* regression head, which we define below:

Let
$$
Y\in \{\mathsf{A}, \mathsf{B}, \mathsf{tie}, \mathsf{bad}\}
$$
and for the sake of notational convenience, let 
$$
\theta^*(z) = \big(\beta^*(z), \eta^*(z)\big); \ \beta^*(z) \in \mathbb{R}^M, \eta^*(z) \in \mathbb{R}_{\geq 1}\}
$$

For notational convenience, we define: 
$$
\varphi^*(z)_i := \exp(\beta^*(z)_i)
$$

Then grounded Rao-Kupper model is defined as:
$$
    g_{\theta^*(z)}(y ; x) =
    \begin{cases}
        \frac{\varphi^*(z)_A}{\varphi^*(z)_A + \eta^*(z)\varphi^*(z)_B + 1} &  y = \mathsf{A} \\
        \frac{\varphi^*(z)_B}{\varphi^*(z)_B + \eta^*(z)\varphi^*(z)_A + 1} &  y = \mathsf{B}\\
        \frac{1}{1 + \varphi^*(z)_A + \varphi^*(z)_B} & y = \mathsf{bad}\\
        1 - \frac{\varphi^*(z)_A}{\varphi^*(z)_A + \eta^*(z)\varphi^*(z)_B + 1}  - \frac{\varphi^*(z)_B}{\varphi^*(z)_B + \eta^*(z)\varphi^*(z)_A + 1} - \frac{1}{1 + \varphi^*(z)_A + \varphi^*(z)_B} & y = \mathsf{tie}.
    \end{cases}
$$

See section 2.2 in our paper for more details on various regression heads.

## Serving
To serve a P2L model, please see our documentation on GitHub: [Serving P2L](https://github.com/lmarena/p2l?tab=readme-ov-file#serving-p2l).

Note: the P2L model outputs with this structure:


```python
class P2LOutputs(ModelOutput):
    coefs: torch.FloatTensor = None # "betas" as described above
    eta: Optional[torch.FloatTensor] = None # tie coefficent (also eta above)
    last_hidden_state: torch.FloatTensor = None # last hidden state from the transformer
```

To understand which coefficient index corresponds with which model, see the [`model_list.json`](./model_list.json) found in the repo of each P2L model. As a general rule, the models will always be in sorted order.

The easiest way to get this list from inside code is with the following:

```python
import json
from huggingface_hub import hf_hub_download

fname = hf_hub_download(
        repo_id="lmarena-ai/p2l-135m-grk-01112025", filename="model_list.json", repo_type="model"
    )

with open(fname) as fin:
    model_list = json.load(fin)
```



### Loading from Pretrained

To define and load the model:

```python

import torch
from transformers import (
    Qwen2Model,
    Qwen2PreTrainedModel,
    LlamaModel,
    LlamaPreTrainedModel,
    PreTrainedModel,
    AutoTokenizer,
)
from transformers import AutoTokenizer
from transformers.utils import ModelOutput
from dataclasses import dataclass
import torch.nn as nn
import torch.nn.functional as F
from typing import Dict, Tuple, Callable, Optional
from huggingface_hub import hf_hub_download
import json


@dataclass
class HeadOutputs(ModelOutput):
    coefs: torch.FloatTensor = None
    eta: Optional[torch.FloatTensor] = None
    gamma: Optional[torch.FloatTensor] = None


@dataclass
class P2LOutputs(ModelOutput):
    coefs: torch.FloatTensor = None
    eta: Optional[torch.FloatTensor] = None
    gamma: Optional[torch.FloatTensor] = None
    loss: Optional[torch.FloatTensor] = None
    last_hidden_state: torch.FloatTensor = None

class RKHead(nn.Module):
    def __init__(
        self,
        input_dim,
        output_dim,
        **kwargs,
    ) -> None:
        super().__init__()
        self.head = nn.Linear(
            in_features=input_dim, out_features=output_dim, bias=True
        )
        self.eta_head = nn.Linear(
            in_features=input_dim, out_features=1, bias=True
        )

    def forward(self, last_hidden_dim: torch.Tensor):
        coefs = self.head(last_hidden_dim)
        eta = self.eta_head(last_hidden_dim)

        return HeadOutputs(coefs=coefs, eta=eta)

class P2LModel(LlamaPreTrainedModel):
    def __init__(
        self,
        config,
        CLS_id,
        num_models,
        head_kwargs={},
        **kwargs,
    ):
        super().__init__(config)

        self.num_models = num_models
        self.cls_token_id = CLS_id

        self.model = LlamaModel(config)

        self.head = RKHead(
            input_dim=config.hidden_size,
            output_dim=self.num_models,
            **head_kwargs,
        )

        self.post_init()

    def freeze_transformer(self):
        for param in self.model.parameters():
            param.requires_grad = False

    def get_input_embeddings(self):
        return self.model.embed_tokens

    def set_input_embeddings(self, value):
        self.model.embed_tokens = value

    def forward(self, input_ids, attention_mask, labels=None, weights=None):
        batch_size = input_ids.shape[0]

        hidden_outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=False,
        ).last_hidden_state  # (bs, num_token, embed_dim)

        cls_mask = input_ids == self.cls_token_id

        # double check this is getting the current CLS token
        cls_hidden_dim = hidden_outputs[cls_mask]

        assert (
            cls_hidden_dim.shape[0] == batch_size
        ), f"input ids {input_ids.shape}, cls_mask {cls_mask.shape}, cls_logit {cls_hidden_dim.shape}"

        head_output = self.head(cls_hidden_dim)

    
        outputs = P2LOutputs(
            coefs=head_output.coefs,
            last_hidden_state=cls_hidden_dim,
            eta=head_output.eta,
            gamma=head_output.gamma,
        )

        return outputs


fname = hf_hub_download(
        repo_id="lmarena-ai/p2l-135m-grk-01112025", filename="model_list.json", repo_type="model"
    )

with open(fname) as fin:
    model_list = json.load(fin)

tokenizer = AutoTokenizer.from_pretrained("lmarena-ai/p2l-135m-grk-01112025")
model = P2LModel.from_pretrained(
    "lmarena-ai/p2l-135m-grk-01112025",
    CLS_id=tokenizer.cls_token_id,
    num_models=len(model_list),
    torch_dtype=torch.bfloat16,
)

```

## Citation

```
@misc{frick2025prompttoleaderboard,
      title={Prompt-to-Leaderboard}, 
      author={Evan Frick and Connor Chen and Joseph Tennyson and Tianle Li and Wei-Lin Chiang and Anastasios N. Angelopoulos and Ion Stoica},
      year={2025},
      eprint={2502.14855},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.14855}, 
}
```