Spaces:
Sleeping
Sleeping
Commit
·
568b91b
1
Parent(s):
77c903f
Clarify text
Browse files- _DISCLAIMER.md +15 -11
- _README.md +18 -19
_DISCLAIMER.md
CHANGED
@@ -1,12 +1,14 @@
|
|
1 |
# Disclaimer
|
2 |
|
3 |
-
This Space is primarily intended for exploration.
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
Space from those familiar with Alpaca
|
9 |
-
welcome!
|
|
|
|
|
10 |
|
11 |
## Resources
|
12 |
|
@@ -15,9 +17,11 @@ welcome!
|
|
15 |
|
16 |
## TODO
|
17 |
|
18 |
-
|
19 |
-
|
|
|
|
|
20 |
|
21 |
-
|
22 |
|
23 |
-
|
|
|
1 |
# Disclaimer
|
2 |
|
3 |
+
This Space is primarily intended for exploration. For now its results
|
4 |
+
should be treated as points of reference rather than absolute
|
5 |
+
facts. Viewers are encouraged to study the pipeline and understand the
|
6 |
+
model to help put the results into context.
|
7 |
+
|
8 |
+
Suggestions for improving this Space from those familiar with Alpaca
|
9 |
+
or Bayesian data analysis are welcome! Please use the
|
10 |
+
[community](https://huggingface.co/spaces/jerome-white/alpaca-eval/discussions)
|
11 |
+
to do so.
|
12 |
|
13 |
## Resources
|
14 |
|
|
|
17 |
|
18 |
## TODO
|
19 |
|
20 |
+
* Extend the Stan model to incorporate ties and response presentation
|
21 |
+
ordering
|
22 |
+
|
23 |
+
* Add details of the MCMC chains
|
24 |
|
25 |
+
* Automate data processing
|
26 |
|
27 |
+
* Explicit documentation of the process
|
_README.md
CHANGED
@@ -1,32 +1,31 @@
|
|
1 |
[Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
|
2 |
evaluation framework. It maintains a set of prompts, along with
|
3 |
-
responses to those prompts from a collection of LLMs. It
|
4 |
-
pairs of responses to a judge
|
5 |
-
addresses the prompt. Rather than compare all response
|
6 |
-
framework
|
7 |
-
that.
|
8 |
-
|
9 |
|
10 |
This Space presents an alternative method of ranking based on the
|
11 |
[Bradley–Terry
|
12 |
model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)
|
13 |
(BT). Given a collection of items, Bradley–Terry estimates the
|
14 |
-
_ability_ of each item based on pairwise comparisons between
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
|
21 |
The Alpaca project presents a good opportunity to apply BT in
|
22 |
practice; especially since BT fits nicely into a Bayesian analysis
|
23 |
-
framework. As LLMs become more pervasive, quantifying
|
24 |
-
|
25 |
-
|
26 |
|
27 |
This Space is divided into two primary sections: the first presents a
|
28 |
ranking of models based on estimated ability. The figure on the right
|
29 |
-
|
30 |
-
presents the full set. The second section estimates the probability
|
31 |
-
that one model will be preferred to another.
|
32 |
-
bottom is a disclaimer that presents details about the workflow.
|
|
|
1 |
[Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
|
2 |
evaluation framework. It maintains a set of prompts, along with
|
3 |
+
responses to those prompts from a collection of LLMs. It presents
|
4 |
+
pairs of responses to a judge who determines which response better
|
5 |
+
addresses the request of the prompt. Rather than compare all response
|
6 |
+
pairs, the framework sets one model as a baseline, then individually
|
7 |
+
compares all responses to that. Its primary method of ranking models
|
8 |
+
via win percentages over the baseline.
|
9 |
|
10 |
This Space presents an alternative method of ranking based on the
|
11 |
[Bradley–Terry
|
12 |
model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)
|
13 |
(BT). Given a collection of items, Bradley–Terry estimates the
|
14 |
+
_ability_ of each item based on pairwise comparisons between
|
15 |
+
them. Once calculated, ability can be used to estimate the probability
|
16 |
+
that one item will be better-than another, even if those items have
|
17 |
+
not been formally compared. In sports, for example, ability might
|
18 |
+
correspond to a teams strength within their league. Ability could then
|
19 |
+
be used to predict outcomes between teams that have yet to play.
|
20 |
|
21 |
The Alpaca project presents a good opportunity to apply BT in
|
22 |
practice; especially since BT fits nicely into a Bayesian analysis
|
23 |
+
framework. As LLMs become more pervasive, quantifying uncertainty in
|
24 |
+
their evaluation is increasingly important; something that Bayesian
|
25 |
+
frameworks do well.
|
26 |
|
27 |
This Space is divided into two primary sections: the first presents a
|
28 |
ranking of models based on estimated ability. The figure on the right
|
29 |
+
visualizes this ranking for the top 10 models, while the table below
|
30 |
+
it presents the full set. The second section estimates the probability
|
31 |
+
that one model will be preferred to another.
|
|