jerome-white commited on
Commit
6ee7b0c
·
1 Parent(s): 8d5a97a

Use overview instead of readme

Browse files

Hugging Face adds information to README.md

Files changed (2) hide show
  1. OVERVIEW.md +38 -0
  2. app.py +1 -1
OVERVIEW.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: alpaca-bt-eval
3
+ app_file: app.py
4
+ sdk: gradio
5
+ sdk_version: 4.19.1
6
+ ---
7
+ [Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
8
+ evaluation framework. It maintains a set of prompts, along with
9
+ responses to those prompts from a collection of LLMs. It then presents
10
+ pairs of responses to a judge that determines which response better
11
+ addresses the prompt. Rather than compare all response pairs, the
12
+ framework identifies a baseline model and compares all models to
13
+ that. The standard method of ranking models is to sort by baseline
14
+ model win percentage.
15
+
16
+ This Space presents an alternative method of ranking based on the
17
+ [Bradley–Terry
18
+ model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)
19
+ (BT). Given a collection of items, Bradley–Terry estimates the
20
+ _ability_ of each item based on pairwise comparisons between them. In
21
+ sports, for example, that might be the ability of a given team based
22
+ on games that team has played within a league. Once calculated,
23
+ ability can be used to estimate the probability that one item will be
24
+ better-than another, even if those items have yet to be formally
25
+ compared.
26
+
27
+ The Alpaca project presents a good opportunity to apply BT in
28
+ practice; especially since BT fits nicely into a Bayesian analysis
29
+ framework. As LLMs become more pervasive, quantifying the uncertainty
30
+ in their evaluation is increasingly important. Bayesian frameworks are
31
+ good at that.
32
+
33
+ This Space is divided into two primary sections: the first presents a
34
+ ranking of models based on estimated ability. The figure on the right
35
+ presents this ranking for the top 10 models, while the table below
36
+ presents the full set. The second section estimates the probability
37
+ that one model will be preferred to another. A final section at the
38
+ bottom is a disclaimer that presents details about the workflow.
app.py CHANGED
@@ -179,7 +179,7 @@ with gr.Blocks() as demo:
179
  gr.Markdown('# Alpaca Bradley–Terry')
180
  with gr.Row():
181
  with gr.Column():
182
- gr.Markdown(Path('README.md').read_text())
183
 
184
  with gr.Column():
185
  plotter = RankPlotter(df)
 
179
  gr.Markdown('# Alpaca Bradley–Terry')
180
  with gr.Row():
181
  with gr.Column():
182
+ gr.Markdown(Path('OVERVIEW.md').read_text())
183
 
184
  with gr.Column():
185
  plotter = RankPlotter(df)