Spaces:
Sleeping
Sleeping
Commit
·
6ee7b0c
1
Parent(s):
8d5a97a
Use overview instead of readme
Browse filesHugging Face adds information to README.md
- OVERVIEW.md +38 -0
- app.py +1 -1
OVERVIEW.md
ADDED
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: alpaca-bt-eval
|
3 |
+
app_file: app.py
|
4 |
+
sdk: gradio
|
5 |
+
sdk_version: 4.19.1
|
6 |
+
---
|
7 |
+
[Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
|
8 |
+
evaluation framework. It maintains a set of prompts, along with
|
9 |
+
responses to those prompts from a collection of LLMs. It then presents
|
10 |
+
pairs of responses to a judge that determines which response better
|
11 |
+
addresses the prompt. Rather than compare all response pairs, the
|
12 |
+
framework identifies a baseline model and compares all models to
|
13 |
+
that. The standard method of ranking models is to sort by baseline
|
14 |
+
model win percentage.
|
15 |
+
|
16 |
+
This Space presents an alternative method of ranking based on the
|
17 |
+
[Bradley–Terry
|
18 |
+
model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)
|
19 |
+
(BT). Given a collection of items, Bradley–Terry estimates the
|
20 |
+
_ability_ of each item based on pairwise comparisons between them. In
|
21 |
+
sports, for example, that might be the ability of a given team based
|
22 |
+
on games that team has played within a league. Once calculated,
|
23 |
+
ability can be used to estimate the probability that one item will be
|
24 |
+
better-than another, even if those items have yet to be formally
|
25 |
+
compared.
|
26 |
+
|
27 |
+
The Alpaca project presents a good opportunity to apply BT in
|
28 |
+
practice; especially since BT fits nicely into a Bayesian analysis
|
29 |
+
framework. As LLMs become more pervasive, quantifying the uncertainty
|
30 |
+
in their evaluation is increasingly important. Bayesian frameworks are
|
31 |
+
good at that.
|
32 |
+
|
33 |
+
This Space is divided into two primary sections: the first presents a
|
34 |
+
ranking of models based on estimated ability. The figure on the right
|
35 |
+
presents this ranking for the top 10 models, while the table below
|
36 |
+
presents the full set. The second section estimates the probability
|
37 |
+
that one model will be preferred to another. A final section at the
|
38 |
+
bottom is a disclaimer that presents details about the workflow.
|
app.py
CHANGED
@@ -179,7 +179,7 @@ with gr.Blocks() as demo:
|
|
179 |
gr.Markdown('# Alpaca Bradley–Terry')
|
180 |
with gr.Row():
|
181 |
with gr.Column():
|
182 |
-
gr.Markdown(Path('
|
183 |
|
184 |
with gr.Column():
|
185 |
plotter = RankPlotter(df)
|
|
|
179 |
gr.Markdown('# Alpaca Bradley–Terry')
|
180 |
with gr.Row():
|
181 |
with gr.Column():
|
182 |
+
gr.Markdown(Path('OVERVIEW.md').read_text())
|
183 |
|
184 |
with gr.Column():
|
185 |
plotter = RankPlotter(df)
|