Spaces:
Sleeping
Sleeping
Commit
·
8d5a97a
1
Parent(s):
eb40ea1
Initial commit
Browse files- DISCLAIMER.md +23 -0
- README.md +33 -8
- app.py +227 -0
- requirements.txt +5 -0
DISCLAIMER.md
ADDED
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Disclaimer
|
2 |
+
|
3 |
+
This Space is primarily intended for exploration. Until otherwise
|
4 |
+
stated, its results should be treated as points of reference rather
|
5 |
+
than absolute fact. Viewers are encouraged to study the pipeline and
|
6 |
+
understand the model before broadcasting strong opinions of model
|
7 |
+
rankings based on what is seen here. Suggestions for improving this
|
8 |
+
Space from those familiar with Alpaca or Bayesian data analysis are
|
9 |
+
welcome!
|
10 |
+
|
11 |
+
## Resources
|
12 |
+
|
13 |
+
* [Source code](https://github.com/jerome-white/alpaca-bda) for
|
14 |
+
producing results
|
15 |
+
|
16 |
+
## TODO
|
17 |
+
|
18 |
+
[] Extend the Stan model to incorporate ties and response presentation
|
19 |
+
ordering
|
20 |
+
|
21 |
+
[] Add details of the MCMC chains
|
22 |
+
|
23 |
+
[] Automate data processing
|
README.md
CHANGED
@@ -1,13 +1,38 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
-
|
4 |
-
colorFrom: green
|
5 |
-
colorTo: blue
|
6 |
sdk: gradio
|
7 |
sdk_version: 4.19.1
|
8 |
-
app_file: app.py
|
9 |
-
pinned: false
|
10 |
-
license: apache-2.0
|
11 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
title: alpaca-bt-eval
|
3 |
+
app_file: app.py
|
|
|
|
|
4 |
sdk: gradio
|
5 |
sdk_version: 4.19.1
|
|
|
|
|
|
|
6 |
---
|
7 |
+
[Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
|
8 |
+
evaluation framework. It maintains a set of prompts, along with
|
9 |
+
responses to those prompts from a collection of LLMs. It then presents
|
10 |
+
pairs of responses to a judge that determines which response better
|
11 |
+
addresses the prompt. Rather than compare all response pairs, the
|
12 |
+
framework identifies a baseline model and compares all models to
|
13 |
+
that. The standard method of ranking models is to sort by baseline
|
14 |
+
model win percentage.
|
15 |
+
|
16 |
+
This Space presents an alternative method of ranking based on the
|
17 |
+
[Bradley–Terry
|
18 |
+
model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)
|
19 |
+
(BT). Given a collection of items, Bradley–Terry estimates the
|
20 |
+
_ability_ of each item based on pairwise comparisons between them. In
|
21 |
+
sports, for example, that might be the ability of a given team based
|
22 |
+
on games that team has played within a league. Once calculated,
|
23 |
+
ability can be used to estimate the probability that one item will be
|
24 |
+
better-than another, even if those items have yet to be formally
|
25 |
+
compared.
|
26 |
+
|
27 |
+
The Alpaca project presents a good opportunity to apply BT in
|
28 |
+
practice; especially since BT fits nicely into a Bayesian analysis
|
29 |
+
framework. As LLMs become more pervasive, quantifying the uncertainty
|
30 |
+
in their evaluation is increasingly important. Bayesian frameworks are
|
31 |
+
good at that.
|
32 |
|
33 |
+
This Space is divided into two primary sections: the first presents a
|
34 |
+
ranking of models based on estimated ability. The figure on the right
|
35 |
+
presents this ranking for the top 10 models, while the table below
|
36 |
+
presents the full set. The second section estimates the probability
|
37 |
+
that one model will be preferred to another. A final section at the
|
38 |
+
bottom is a disclaimer that presents details about the workflow.
|
app.py
ADDED
@@ -0,0 +1,227 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import math
|
2 |
+
import operator as op
|
3 |
+
import itertools as it
|
4 |
+
import functools as ft
|
5 |
+
import collections as cl
|
6 |
+
from pathlib import Path
|
7 |
+
|
8 |
+
import pandas as pd
|
9 |
+
import gradio as gr
|
10 |
+
import seaborn as sns
|
11 |
+
from datasets import load_dataset
|
12 |
+
from scipy.special import expit
|
13 |
+
|
14 |
+
HDI = cl.namedtuple('HDI', 'lower, upper')
|
15 |
+
|
16 |
+
#
|
17 |
+
# See https://cran.r-project.org/package=HDInterval
|
18 |
+
#
|
19 |
+
def hdi(values, ci=0.95):
|
20 |
+
values = sorted(filter(math.isfinite, values))
|
21 |
+
if not values:
|
22 |
+
raise ValueError('Empty data set')
|
23 |
+
|
24 |
+
n = len(values)
|
25 |
+
exclude = n - math.floor(n * ci)
|
26 |
+
|
27 |
+
left = it.islice(values, exclude)
|
28 |
+
right = it.islice(values, n - exclude, None)
|
29 |
+
|
30 |
+
diffs = ((x, y, y - x) for (x, y) in zip(left, right))
|
31 |
+
(*args, _) = min(diffs, key=op.itemgetter(-1))
|
32 |
+
|
33 |
+
return HDI(*args)
|
34 |
+
|
35 |
+
#
|
36 |
+
#
|
37 |
+
#
|
38 |
+
def load(repo):
|
39 |
+
parameter = 'parameter'
|
40 |
+
items = [
|
41 |
+
'chain',
|
42 |
+
'sample',
|
43 |
+
parameter,
|
44 |
+
'model',
|
45 |
+
'value',
|
46 |
+
]
|
47 |
+
dataset = load_dataset(repo)
|
48 |
+
|
49 |
+
return (dataset
|
50 |
+
.get('train')
|
51 |
+
.to_pandas()
|
52 |
+
.filter(items=items)
|
53 |
+
.query(f'{parameter} == "alpha"')
|
54 |
+
.drop(columns=parameter))
|
55 |
+
|
56 |
+
def summarize(df, ci=0.95):
|
57 |
+
def _aggregate(i, g):
|
58 |
+
values = g['value']
|
59 |
+
interval = hdi(values, ci)
|
60 |
+
|
61 |
+
agg = {
|
62 |
+
'model': i,
|
63 |
+
'ability': values.median(),
|
64 |
+
'uncertainty': interval.upper - interval.lower,
|
65 |
+
}
|
66 |
+
agg.update(interval._asdict())
|
67 |
+
|
68 |
+
return agg
|
69 |
+
|
70 |
+
groups = df.groupby('model', sort=False)
|
71 |
+
records = it.starmap(_aggregate, groups)
|
72 |
+
|
73 |
+
return pd.DataFrame.from_records(records)
|
74 |
+
|
75 |
+
def rank(df, ascending, name='rank'):
|
76 |
+
df = (df
|
77 |
+
.sort_values(by=['ability', 'uncertainty'],
|
78 |
+
ascending=[ascending, not ascending])
|
79 |
+
.drop(columns='uncertainty')
|
80 |
+
.reset_index(drop=True))
|
81 |
+
df.index += 1
|
82 |
+
|
83 |
+
return df.reset_index(names=name)
|
84 |
+
|
85 |
+
def compare(df, model_1, model_2):
|
86 |
+
mcol = 'model'
|
87 |
+
models = [
|
88 |
+
model_1,
|
89 |
+
model_2,
|
90 |
+
]
|
91 |
+
view = (df
|
92 |
+
.query(f'{mcol} in @models')
|
93 |
+
.pivot(index=['chain', 'sample'],
|
94 |
+
columns=mcol,
|
95 |
+
values='value'))
|
96 |
+
|
97 |
+
return expit(view[model_1] - view[model_2])
|
98 |
+
|
99 |
+
#
|
100 |
+
#
|
101 |
+
#
|
102 |
+
class DataPlotter:
|
103 |
+
def __init__(self, df):
|
104 |
+
self.df = df
|
105 |
+
|
106 |
+
def plot(self):
|
107 |
+
ax = self.draw()
|
108 |
+
ax.grid(visible=True,
|
109 |
+
axis='both',
|
110 |
+
alpha=0.25,
|
111 |
+
linestyle='dotted')
|
112 |
+
|
113 |
+
fig = ax.get_figure()
|
114 |
+
fig.tight_layout()
|
115 |
+
|
116 |
+
return fig
|
117 |
+
|
118 |
+
def draw(self):
|
119 |
+
raise NotImplementedError()
|
120 |
+
|
121 |
+
class RankPlotter(DataPlotter):
|
122 |
+
_y = 'y'
|
123 |
+
|
124 |
+
@ft.cached_property
|
125 |
+
def y(self):
|
126 |
+
return self.df[self._y]
|
127 |
+
|
128 |
+
def __init__(self, df, top=10):
|
129 |
+
view = rank(summarize(df), True, self._y)
|
130 |
+
view = (view
|
131 |
+
.tail(top)
|
132 |
+
.sort_values(by=self._y, ascending=False))
|
133 |
+
super().__init__(view)
|
134 |
+
|
135 |
+
def draw(self):
|
136 |
+
ax = self.df.plot.scatter('ability', self._y)
|
137 |
+
ax.hlines(self.y,
|
138 |
+
xmin=self.df['lower'],
|
139 |
+
xmax=self.df['upper'],
|
140 |
+
alpha=0.5)
|
141 |
+
ax.set_ylabel('')
|
142 |
+
ax.set_yticks(self.y, self.df['model'])
|
143 |
+
|
144 |
+
return ax
|
145 |
+
|
146 |
+
class ComparisonPlotter(DataPlotter):
|
147 |
+
def __init__(self, df, model_1, model_2, ci=0.95):
|
148 |
+
super().__init__(compare(df, model_1, model_2))
|
149 |
+
self.interval = hdi(self.df, ci)
|
150 |
+
|
151 |
+
def draw(self):
|
152 |
+
ax = sns.ecdfplot(self.df)
|
153 |
+
|
154 |
+
(_, color, *_) = sns.color_palette()
|
155 |
+
ax.axvline(x=self.df.median(),
|
156 |
+
color=color,
|
157 |
+
linestyle='dashed')
|
158 |
+
ax.axvspan(xmin=self.interval.lower,
|
159 |
+
xmax=self.interval.upper,
|
160 |
+
alpha=0.15,
|
161 |
+
color=color)
|
162 |
+
ax.set_xlabel('Pr(M$_{1}$ \u003E M$_{2}$)')
|
163 |
+
|
164 |
+
return ax
|
165 |
+
|
166 |
+
def cplot(df, ci=0.95):
|
167 |
+
def _plot(model_1, model_2):
|
168 |
+
cp = ComparisonPlotter(df, model_1, model_2, ci)
|
169 |
+
return cp.plot()
|
170 |
+
|
171 |
+
return _plot
|
172 |
+
|
173 |
+
#
|
174 |
+
#
|
175 |
+
#
|
176 |
+
with gr.Blocks() as demo:
|
177 |
+
df = load('jerome-white/alpaca-bt-stan')
|
178 |
+
|
179 |
+
gr.Markdown('# Alpaca Bradley–Terry')
|
180 |
+
with gr.Row():
|
181 |
+
with gr.Column():
|
182 |
+
gr.Markdown(Path('README.md').read_text())
|
183 |
+
|
184 |
+
with gr.Column():
|
185 |
+
plotter = RankPlotter(df)
|
186 |
+
gr.Plot(plotter.plot())
|
187 |
+
|
188 |
+
with gr.Row():
|
189 |
+
view = rank(summarize(df), False)
|
190 |
+
columns = { x: f'HDI {x}' for x in HDI._fields }
|
191 |
+
for i in view.columns:
|
192 |
+
columns.setdefault(i, i.title())
|
193 |
+
view = (view
|
194 |
+
.rename(columns=columns)
|
195 |
+
.style.format(precision=4))
|
196 |
+
|
197 |
+
gr.Dataframe(view)
|
198 |
+
|
199 |
+
with gr.Row():
|
200 |
+
with gr.Column(scale=3):
|
201 |
+
display = gr.Plot()
|
202 |
+
|
203 |
+
with gr.Row():
|
204 |
+
with gr.Column():
|
205 |
+
gr.Markdown('''
|
206 |
+
|
207 |
+
Probability that Model 1 is preferred to Model 2. The
|
208 |
+
solid blue curve is a CDF of that distribution;
|
209 |
+
formally the inverse logit of the difference in model
|
210 |
+
abilities. The dashed orange vertical line is the
|
211 |
+
median, while the band surrounding it is its 95%
|
212 |
+
[highest density
|
213 |
+
interval](https://cran.r-project.org/package=HDInterval).
|
214 |
+
|
215 |
+
''')
|
216 |
+
with gr.Column():
|
217 |
+
models = sorted(df['model'].unique(), key=lambda x: x.lower())
|
218 |
+
drops = ft.partial(gr.Dropdown, choices=models)
|
219 |
+
inputs = [ drops(label=f'Model {x}') for x in range(1, 3) ]
|
220 |
+
|
221 |
+
button = gr.Button(value='Compare!')
|
222 |
+
button.click(cplot(df), inputs=inputs, outputs=[display])
|
223 |
+
|
224 |
+
with gr.Accordion('Disclaimer', open=False):
|
225 |
+
gr.Markdown(Path('DISCLAIMER.md').read_text())
|
226 |
+
|
227 |
+
demo.launch()
|
requirements.txt
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
datasets
|
2 |
+
gradio
|
3 |
+
pandas
|
4 |
+
scipy
|
5 |
+
seaborn
|