Spaces:
Running
Running
Ludwig Stumpp
commited on
Commit
Β·
908b597
1
Parent(s):
1df71ac
Switch back to markdown as easier diffable
Browse files- .vscode/extensions.json +1 -2
- README.md +41 -7
- data/benchmarks.csv +0 -4
- data/leaderboard.csv +0 -20
- data/sources.csv +0 -3
- streamlit_app.py +74 -35
.vscode/extensions.json
CHANGED
@@ -1,6 +1,5 @@
|
|
1 |
{
|
2 |
"recommendations": [
|
3 |
-
"
|
4 |
-
"mechatroner.rainbow-csv"
|
5 |
]
|
6 |
}
|
|
|
1 |
{
|
2 |
"recommendations": [
|
3 |
+
"takumii.markdowntable"
|
|
|
4 |
]
|
5 |
}
|
README.md
CHANGED
@@ -1,21 +1,55 @@
|
|
1 |
# π llm-leaderboard
|
2 |
-
A joint community effort to create one central leaderboard for LLMs. Contributions and corrections welcome!
|
3 |
-
|
4 |
-
## Leaderboard
|
5 |
|
|
|
6 |
Visit the interactive leaderboard at https://llm-leaderboard.streamlit.app/.
|
7 |
|
8 |
-
![Screenshot of streamlit application](media/streamlit_screenshot.jpg)
|
9 |
-
|
10 |
## How to Contribute
|
11 |
|
12 |
We are always happy for contributions! You can contribute by the following:
|
13 |
|
14 |
- table work:
|
15 |
- filling missing entries
|
16 |
-
- adding a new model as a new row
|
17 |
-
- adding a new benchmark as a new column in
|
18 |
- code work:
|
19 |
- improving the existing code
|
20 |
- requesting and implementing new features
|
21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
# π llm-leaderboard
|
|
|
|
|
|
|
2 |
|
3 |
+
A joint community effort to create one central leaderboard for LLMs. Contributions and corrections welcome!
|
4 |
Visit the interactive leaderboard at https://llm-leaderboard.streamlit.app/.
|
5 |
|
|
|
|
|
6 |
## How to Contribute
|
7 |
|
8 |
We are always happy for contributions! You can contribute by the following:
|
9 |
|
10 |
- table work:
|
11 |
- filling missing entries
|
12 |
+
- adding a new model as a new row to the leaderboard and add the source of the evaluation to the sources table
|
13 |
+
- adding a new benchmark as a new column in the leaderboard and add the benchmark to the benchmarks table
|
14 |
- code work:
|
15 |
- improving the existing code
|
16 |
- requesting and implementing new features
|
17 |
|
18 |
+
## Leaderboard
|
19 |
+
|
20 |
+
| Model Name | Chatbot Arena Elo | LAMBADA (zero-shot) | TriviaQA (zero-shot) |
|
21 |
+
| ----------------------- | ----------------- | ------------------- | -------------------- |
|
22 |
+
| alpaca-13b | 1008 | | |
|
23 |
+
| cerebras-7b | | 0.636 | 0.141 |
|
24 |
+
| cerebras-13b | | 0.635 | 0.146 |
|
25 |
+
| chatglm-6b | 985 | | |
|
26 |
+
| dolly-v2-12b | 944 | | |
|
27 |
+
| fastchat-t5-3b | 951 | | |
|
28 |
+
| gpt-neox-20b | | 0.719 | 0.347 |
|
29 |
+
| gptj-6b | | 0.683 | 0.234 |
|
30 |
+
| koala-13b | 1082 | | |
|
31 |
+
| llama-7b | | 0.738 | 0.443 |
|
32 |
+
| llama-13b | 932 | | |
|
33 |
+
| mpt-7b | | 0.702 | 0.343 |
|
34 |
+
| opt-7b | | 0.677 | 0.227 |
|
35 |
+
| opt-13b | | 0.692 | 0.282 |
|
36 |
+
| stablelm-base-alpha-7b | | 0.533 | 0.049 |
|
37 |
+
| stablelm-tuned-alpha-7b | 858 | | |
|
38 |
+
| vicuna-13b | 1169 | | |
|
39 |
+
| oasst-pythia-7b | | 0.667 | 0.198 |
|
40 |
+
| oasst-pythia-12b | 1065 | 0.704 | 0.233 |
|
41 |
+
|
42 |
+
## Benchmarks
|
43 |
+
|
44 |
+
| Benchmark Name | Author | Link | Description |
|
45 |
+
| ----------------- | -------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
46 |
+
| Chatbot Arena Elo | LMSYS | https://lmsys.org/blog/2023-05-03-arena/ | "In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games." (Source: https://lmsys.org/blog/2023-05-03-arena/) |
|
47 |
+
| LAMBADA | Paperno et al. | https://arxiv.org/abs/1606.06031" | "The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse." (Source: https://huggingface.co/datasets/lambada) |
|
48 |
+
| TriviaQA | Joshi et al. | https://arxiv.org/abs/1705.03551v2" | "We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions." (Source: https://arxiv.org/abs/1705.03551v2) |
|
49 |
+
|
50 |
+
## Sources
|
51 |
+
|
52 |
+
| Author | Link |
|
53 |
+
| -------- | ---------------------------------------- |
|
54 |
+
| LMSYS | https://lmsys.org/blog/2023-05-03-arena/ |
|
55 |
+
| MOSAICML | https://www.mosaicml.com/blog/mpt-7b |
|
data/benchmarks.csv
DELETED
@@ -1,4 +0,0 @@
|
|
1 |
-
"Benchmark Name" ,"Author" ,"Link" ,"Description "
|
2 |
-
"Chatbot Arena Elo" ,"LMSYS" ,"https://lmsys.org/blog/2023-05-03-arena/" ,"In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games. (Source: https://lmsys.org/blog/2023-05-03-arena/)"
|
3 |
-
"LAMBADA" ,"Paperno et al." ,"https://arxiv.org/abs/1606.06031" ,"The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse. (Source: https://huggingface.co/datasets/lambada)"
|
4 |
-
"TriviaQA" ,"Joshi et al." ,"https://arxiv.org/abs/1705.03551v2" ,"We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. (Source: https://arxiv.org/abs/1705.03551v2)"
|
|
|
|
|
|
|
|
|
|
data/leaderboard.csv
DELETED
@@ -1,20 +0,0 @@
|
|
1 |
-
Model Name ,Chatbot Arena Elo ,LAMBADA (zero-shot) ,TriviaQA (zero-shot)
|
2 |
-
alpaca-13b , 1008 , ,
|
3 |
-
cerebras-7b , , 0.636 , 0.141
|
4 |
-
cerebras-13b , , 0.635 , 0.146
|
5 |
-
chatglm-6b , 985 , ,
|
6 |
-
dolly-v2-12b , 944 , ,
|
7 |
-
fastchat-t5-3b , 951 , ,
|
8 |
-
gpt-neox-20b , , 0.719 , 0.347
|
9 |
-
gptj-6b , , 0.683 , 0.234
|
10 |
-
koala-13b , 1082 , ,
|
11 |
-
llama-7b , , 0.738 , 0.443
|
12 |
-
llama-13b , 932 , ,
|
13 |
-
mpt-7b , , 0.702 , 0.343
|
14 |
-
opt-7b , , 0.677 , 0.227
|
15 |
-
opt-13b , , 0.692 , 0.282
|
16 |
-
stablelm-base-alpha-7b , , 0.533 , 0.049
|
17 |
-
stablelm-tuned-alpha-7b , 858 , ,
|
18 |
-
vicuna-13b , 1169 , ,
|
19 |
-
oasst-pythia-7b , , 0.667 , 0.198
|
20 |
-
oasst-pythia-12b , 1065 , 0.704 , 0.233
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
data/sources.csv
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
Author ,Link
|
2 |
-
LMSYS ,https://lmsys.org/blog/2023-05-03-arena/
|
3 |
-
MOSAICML ,https://www.mosaicml.com/blog/mpt-7b
|
|
|
|
|
|
|
|
streamlit_app.py
CHANGED
@@ -1,25 +1,62 @@
|
|
1 |
import pandas as pd
|
2 |
-
import requests
|
3 |
import streamlit as st
|
|
|
4 |
|
5 |
-
REPO_URL = "https://github.com/LudwigStumpp/llm-leaderboard"
|
6 |
-
LEADERBOARD_PATH = "data/leaderboard.csv"
|
7 |
-
BENCHMARKS_PATH = "data/benchmarks.csv"
|
8 |
-
SOURCES_PATH = "data/sources.csv"
|
9 |
|
|
|
|
|
10 |
|
11 |
-
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
|
14 |
Args:
|
15 |
-
|
16 |
-
|
|
|
17 |
|
18 |
Returns:
|
19 |
-
str:
|
|
|
|
|
|
|
20 |
"""
|
21 |
-
|
22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
|
25 |
def filter_dataframe(df: pd.DataFrame) -> pd.DataFrame:
|
@@ -56,7 +93,7 @@ def filter_dataframe(df: pd.DataFrame) -> pd.DataFrame:
|
|
56 |
|
57 |
|
58 |
def setup_basic():
|
59 |
-
title = "LLM-Leaderboard"
|
60 |
|
61 |
st.set_page_config(
|
62 |
page_title=title,
|
@@ -73,24 +110,22 @@ def setup_basic():
|
|
73 |
)
|
74 |
|
75 |
|
76 |
-
def
|
77 |
-
|
78 |
-
|
79 |
-
df = df.replace(r"^\s*$", float("nan"), regex=True)
|
80 |
-
df = df.astype(float, errors="ignore")
|
81 |
|
82 |
-
st.markdown("
|
83 |
-
st.dataframe(filter_dataframe(
|
84 |
|
85 |
|
86 |
-
def setup_benchmarks():
|
87 |
-
|
88 |
-
|
89 |
|
90 |
-
st.markdown("
|
91 |
|
92 |
-
selected_benchmark = st.selectbox("Select a benchmark to learn more:",
|
93 |
-
df_selected =
|
94 |
text = [
|
95 |
f"Name: {selected_benchmark} ",
|
96 |
]
|
@@ -99,14 +134,14 @@ def setup_benchmarks():
|
|
99 |
st.markdown("\n".join(text))
|
100 |
|
101 |
|
102 |
-
def setup_sources():
|
103 |
-
|
104 |
-
|
105 |
|
106 |
-
st.markdown("
|
107 |
|
108 |
-
selected_source = st.selectbox("Select a source to learn more:",
|
109 |
-
df_selected =
|
110 |
text = [
|
111 |
f"Author: {selected_source} ",
|
112 |
]
|
@@ -126,9 +161,13 @@ def setup_footer():
|
|
126 |
|
127 |
def main():
|
128 |
setup_basic()
|
129 |
-
|
130 |
-
|
131 |
-
|
|
|
|
|
|
|
|
|
132 |
setup_footer()
|
133 |
|
134 |
|
|
|
1 |
import pandas as pd
|
|
|
2 |
import streamlit as st
|
3 |
+
import io
|
4 |
|
|
|
|
|
|
|
|
|
5 |
|
6 |
+
def extract_table_and_format_from_markdown_text(markdown_table: str) -> pd.DataFrame:
|
7 |
+
"""Extracts a table from a markdown text and formats it as a pandas DataFrame.
|
8 |
|
9 |
+
Args:
|
10 |
+
text (str): Markdown text containing a table.
|
11 |
+
|
12 |
+
Returns:
|
13 |
+
pd.DataFrame: Table as pandas DataFrame.
|
14 |
+
"""
|
15 |
+
df = (
|
16 |
+
pd.read_table(io.StringIO(markdown_table), sep="|", header=0, index_col=1)
|
17 |
+
.dropna(axis=1, how="all") # drop empty columns
|
18 |
+
.iloc[1:] # drop first row which is the "----" separator of the original markdown table
|
19 |
+
.sort_index(ascending=True)
|
20 |
+
.replace(r"^\s*$", float("nan"), regex=True)
|
21 |
+
.astype(float, errors="ignore")
|
22 |
+
)
|
23 |
+
|
24 |
+
# remove whitespace from column names and index
|
25 |
+
df.columns = df.columns.str.strip()
|
26 |
+
df.index = df.index.str.strip()
|
27 |
+
|
28 |
+
return df
|
29 |
+
|
30 |
+
|
31 |
+
def extract_markdown_table_from_multiline(multiline: str, table_headline: str, next_headline_start: str = "#") -> str:
|
32 |
+
"""Extracts the markdown table from a multiline string.
|
33 |
|
34 |
Args:
|
35 |
+
multiline (str): content of README.md file.
|
36 |
+
table_headline (str): Headline of the table in the README.md file.
|
37 |
+
next_headline_start (str, optional): Start of the next headline. Defaults to "#".
|
38 |
|
39 |
Returns:
|
40 |
+
str: Markdown table.
|
41 |
+
|
42 |
+
Raises:
|
43 |
+
ValueError: If the table could not be found.
|
44 |
"""
|
45 |
+
# extract everything between the table headline and the next headline
|
46 |
+
table = []
|
47 |
+
start = False
|
48 |
+
for line in multiline.split("\n"):
|
49 |
+
if line.startswith(table_headline):
|
50 |
+
start = True
|
51 |
+
elif line.startswith(next_headline_start):
|
52 |
+
start = False
|
53 |
+
elif start:
|
54 |
+
table.append(line + "\n")
|
55 |
+
|
56 |
+
if len(table) == 0:
|
57 |
+
raise ValueError(f"Could not find table with headline '{table_headline}'")
|
58 |
+
|
59 |
+
return "".join(table)
|
60 |
|
61 |
|
62 |
def filter_dataframe(df: pd.DataFrame) -> pd.DataFrame:
|
|
|
93 |
|
94 |
|
95 |
def setup_basic():
|
96 |
+
title = "π LLM-Leaderboard"
|
97 |
|
98 |
st.set_page_config(
|
99 |
page_title=title,
|
|
|
110 |
)
|
111 |
|
112 |
|
113 |
+
def setup_leaderboard(readme: str):
|
114 |
+
leaderboard_table = extract_markdown_table_from_multiline(readme, table_headline="## Leaderboard")
|
115 |
+
df_leaderboard = extract_table_and_format_from_markdown_text(leaderboard_table)
|
|
|
|
|
116 |
|
117 |
+
st.markdown("## Leaderboard")
|
118 |
+
st.dataframe(filter_dataframe(df_leaderboard))
|
119 |
|
120 |
|
121 |
+
def setup_benchmarks(readme: str):
|
122 |
+
benchmarks_table = extract_markdown_table_from_multiline(readme, table_headline="## Benchmarks")
|
123 |
+
df_benchmarks = extract_table_and_format_from_markdown_text(benchmarks_table)
|
124 |
|
125 |
+
st.markdown("## Covered Benchmarks")
|
126 |
|
127 |
+
selected_benchmark = st.selectbox("Select a benchmark to learn more:", df_benchmarks.index.unique())
|
128 |
+
df_selected = df_benchmarks.loc[selected_benchmark]
|
129 |
text = [
|
130 |
f"Name: {selected_benchmark} ",
|
131 |
]
|
|
|
134 |
st.markdown("\n".join(text))
|
135 |
|
136 |
|
137 |
+
def setup_sources(readme: str):
|
138 |
+
sources_table = extract_markdown_table_from_multiline(readme, table_headline="## Sources")
|
139 |
+
df_sources = extract_table_and_format_from_markdown_text(sources_table)
|
140 |
|
141 |
+
st.markdown("## Sources of Above Figures")
|
142 |
|
143 |
+
selected_source = st.selectbox("Select a source to learn more:", df_sources.index.unique())
|
144 |
+
df_selected = df_sources.loc[selected_source]
|
145 |
text = [
|
146 |
f"Author: {selected_source} ",
|
147 |
]
|
|
|
161 |
|
162 |
def main():
|
163 |
setup_basic()
|
164 |
+
|
165 |
+
with open("README.md", "r") as f:
|
166 |
+
readme = f.read()
|
167 |
+
|
168 |
+
setup_leaderboard(readme)
|
169 |
+
setup_benchmarks(readme)
|
170 |
+
setup_sources(readme)
|
171 |
setup_footer()
|
172 |
|
173 |
|