Spaces:

jerome-white
/

llm-bradley-terry

Sleeping

App Files Files Community

jerome-white commited on Feb 20, 2024

Commit

6ee7b0c

1 Parent(s): 8d5a97a

Use overview instead of readme

Browse files

Hugging Face adds information to README.md

Files changed (2) hide show

OVERVIEW.md +38 -0
app.py +1 -1

OVERVIEW.md ADDED Viewed

	@@ -0,0 +1,38 @@

+---
+title: alpaca-bt-eval
+app_file: app.py
+sdk: gradio
+sdk_version: 4.19.1
+---
+[Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
+evaluation framework. It maintains a set of prompts, along with
+responses to those prompts from a collection of LLMs. It then presents
+pairs of responses to a judge that determines which response better
+addresses the prompt. Rather than compare all response pairs, the
+framework identifies a baseline model and compares all models to
+that. The standard method of ranking models is to sort by baseline
+model win percentage.
+This Space presents an alternative method of ranking based on the
+[Bradley–Terry
+model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)
+(BT). Given a collection of items, Bradley–Terry estimates the
+_ability_ of each item based on pairwise comparisons between them. In
+sports, for example, that might be the ability of a given team based
+on games that team has played within a league. Once calculated,
+ability can be used to estimate the probability that one item will be
+better-than another, even if those items have yet to be formally
+compared.
+The Alpaca project presents a good opportunity to apply BT in
+practice; especially since BT fits nicely into a Bayesian analysis
+framework. As LLMs become more pervasive, quantifying the uncertainty
+in their evaluation is increasingly important. Bayesian frameworks are
+good at that.
+This Space is divided into two primary sections: the first presents a
+ranking of models based on estimated ability. The figure on the right
+presents this ranking for the top 10 models, while the table below
+presents the full set. The second section estimates the probability
+that one model will be preferred to another. A final section at the
+bottom is a disclaimer that presents details about the workflow.

app.py CHANGED Viewed

@@ -179,7 +179,7 @@ with gr.Blocks() as demo:
     gr.Markdown('# Alpaca Bradley–Terry')
     with gr.Row():
         with gr.Column():
-            gr.Markdown(Path('README.md').read_text())
         with gr.Column():
             plotter = RankPlotter(df)

     gr.Markdown('# Alpaca Bradley–Terry')
     with gr.Row():
         with gr.Column():
+            gr.Markdown(Path('OVERVIEW.md').read_text())
         with gr.Column():
             plotter = RankPlotter(df)