PeterKruger commited on
Commit
df35d1a
·
1 Parent(s): 1a0aefb

Add multi-leaderboard support with navigation, enhanced metrics, and correlations

Browse files
README.md CHANGED
@@ -8,7 +8,156 @@ sdk_version: 5.27.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
- short_description: Summary of Results for AutoBench as of 24 April 2025
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
+ short_description: Interactive multi-run leaderboard for AutoBench LLM evaluations with historical navigation
12
  ---
13
 
14
+ # AutoBench LLM Leaderboard
15
+
16
+ Interactive leaderboard for AutoBench, where Large Language Models (LLMs) evaluate and rank responses from other LLMs. This application supports multiple benchmark runs with seamless navigation between different time periods.
17
+
18
+ ## 🌟 Features
19
+
20
+ ### Multi-Run Navigation
21
+ - **📊 Run Selector**: Switch between different AutoBench runs using the dropdown menu
22
+ - **🕐 Historical Data**: View and compare results across different time periods
23
+ - **🔄 Reactive Interface**: All tabs and visualizations update automatically when switching runs
24
+ - **📈 Enhanced Metrics**: Support for evaluation iterations and fail rates in newer runs
25
+
26
+ ### Comprehensive Analysis
27
+ - **Overall Ranking**: Model performance with AutoBench scores, costs, latency, and reliability metrics
28
+ - **Benchmark Comparison**: Correlations with Chatbot Arena, AAI Index, and MMLU benchmarks
29
+ - **Performance Plots**: Interactive scatter plots showing cost vs. performance trade-offs
30
+ - **Cost & Latency Analysis**: Detailed breakdown by domain and response time percentiles
31
+ - **Domain Performance**: Model rankings across specific knowledge areas
32
+
33
+ ### Dynamic Features
34
+ - **📊 Benchmark Correlations**: Displays correlation percentages with other popular benchmarks
35
+ - **💰 Cost Conversion**: Automatic conversion to cents for better readability
36
+ - **⚡ Performance Metrics**: Average and P99 latency measurements
37
+ - **🎯 Fail Rate Tracking**: Model reliability metrics (for supported runs)
38
+ - **🔢 Iteration Counts**: Number of evaluations per model (for supported runs)
39
+
40
+ ## 🚀 How to Use
41
+
42
+ ### Navigation
43
+ 1. **Select a Run**: Use the dropdown menu at the top to choose between available benchmark runs
44
+ 2. **Explore Tabs**: Navigate through different analysis views using the tab interface
45
+ 3. **Interactive Tables**: Sort and filter data by clicking on column headers
46
+ 4. **Hover for Details**: Get additional information by hovering over chart elements
47
+
48
+ ### Understanding the Data
49
+ - **AutoBench Score**: Higher scores indicate better performance
50
+ - **Cost**: Lower values are better (displayed in cents per response)
51
+ - **Latency**: Lower response times are better (average and P99 percentiles)
52
+ - **Fail Rate**: Lower percentages indicate more reliable models
53
+ - **Iterations**: Number of evaluation attempts per model
54
+
55
+ ## 🔧 Adding New Runs
56
+
57
+ ### Directory Structure
58
+ ```
59
+ runs/
60
+ ├── run_YYYY-MM-DD/
61
+ │ ├── metadata.json # Run information and metadata
62
+ │ ├── correlations.json # Benchmark correlation data
63
+ │ ├── summary_data.csv # Main leaderboard data
64
+ │ ├── domain_ranks.csv # Domain-specific rankings
65
+ │ ├── cost_data.csv # Cost breakdown by domain
66
+ │ ├── avg_latency.csv # Average latency by domain
67
+ │ └── p99_latency.csv # P99 latency by domain
68
+ ```
69
+
70
+ ### Required Files
71
+
72
+ #### 1. metadata.json
73
+ ```json
74
+ {
75
+ "run_id": "run_2025-08-14",
76
+ "title": "AutoBench Run 3 - August 2025",
77
+ "date": "2025-08-14",
78
+ "description": "Latest AutoBench run with enhanced metrics",
79
+ "blog_url": "https://huggingface.co/blog/PeterKruger/autobench-3rd-run",
80
+ "model_count": 34,
81
+ "is_latest": true
82
+ }
83
+ ```
84
+
85
+ #### 2. correlations.json
86
+ ```json
87
+ {
88
+ "correlations": {
89
+ "Chatbot Arena": 82.51,
90
+ "Artificial Analysis Intelligence Index": 83.74,
91
+ "MMLU": 71.51
92
+ },
93
+ "description": "Correlation percentages between AutoBench scores and other benchmark scores"
94
+ }
95
+ ```
96
+
97
+ #### 3. summary_data.csv
98
+ Required columns:
99
+ - `Model`: Model name
100
+ - `AutoBench`: AutoBench score
101
+ - `Costs (USD)`: Cost per response in USD
102
+ - `Avg Answer Duration (sec)`: Average response time
103
+ - `P99 Answer Duration (sec)`: 99th percentile response time
104
+
105
+ Optional columns (for enhanced metrics):
106
+ - `Iterations`: Number of evaluation iterations
107
+ - `Fail Rate %`: Percentage of failed responses
108
+ - `LMArena` or `Chatbot Ar.`: Chatbot Arena scores
109
+ - `MMLU-Pro` or `MMLU Index`: MMLU benchmark scores
110
+ - `AAI Index`: Artificial Analysis Intelligence Index scores
111
+
112
+ ### Adding a New Run
113
+
114
+ 1. **Create Directory**: `mkdir runs/run_YYYY-MM-DD`
115
+ 2. **Add Data Files**: Copy your CSV files to the new directory
116
+ 3. **Create Metadata**: Add `metadata.json` with run information
117
+ 4. **Add Correlations**: Create `correlations.json` with benchmark correlations
118
+ 5. **Update Previous Run**: Set `"is_latest": false` in the previous latest run's metadata
119
+ 6. **Restart App**: The new run will be automatically discovered
120
+
121
+ ### Column Compatibility
122
+
123
+ The application automatically adapts to different column structures:
124
+ - **Legacy Runs**: Support basic columns (Model, AutoBench, Cost, Latency)
125
+ - **Enhanced Runs**: Include additional metrics (Iterations, Fail Rate %)
126
+ - **Flexible Naming**: Handles variations in benchmark column names
127
+
128
+ ## 🛠️ Development
129
+
130
+ ### Requirements
131
+ - Python 3.8+
132
+ - Gradio 5.27.0+
133
+ - Pandas
134
+ - Plotly
135
+
136
+ ### Installation
137
+ ```bash
138
+ pip install -r requirements.txt
139
+ ```
140
+
141
+ ### Running Locally
142
+ ```bash
143
+ python app.py
144
+ ```
145
+
146
+ ### killing all python processes
147
+ ```bash
148
+ taskkill /F /IM python.exe 2>/dev/null || echo "No Python processes to kill"
149
+ ```
150
+
151
+ The app will automatically discover available runs and launch on a local port.
152
+
153
+ ## 📊 Data Sources
154
+
155
+ AutoBench evaluations are conducted using LLM-generated questions across diverse domains, with responses ranked by evaluation LLMs. For more information about the methodology, visit the [AutoBench blog posts](https://huggingface.co/blog/PeterKruger/autobench).
156
+
157
+ ## 📄 License
158
+
159
+ MIT License - see LICENSE file for details.
160
+
161
+ ---
162
+
163
+ Check out the [Hugging Face Spaces configuration reference](https://huggingface.co/docs/hub/spaces-config-reference) for deployment options.
app.py CHANGED
@@ -1,15 +1,206 @@
1
  import gradio as gr
2
  import pandas as pd
3
  import plotly.express as px
4
- import os # To check if files exist
 
 
5
 
6
  # --- Configuration ---
7
- DATA_DIR = "." # Assume CSV files are in the same directory as app.py
8
- SUMMARY_FILE = os.path.join(DATA_DIR, "data/summary_data.csv")
9
- DOMAIN_RANKS_FILE = os.path.join(DATA_DIR, "data/domain_ranks.csv")
10
- COST_FILE = os.path.join(DATA_DIR, "data/cost_data.csv")
11
- AVG_LATENCY_FILE = os.path.join(DATA_DIR, "data/avg_latency.csv")
12
- P99_LATENCY_FILE = os.path.join(DATA_DIR, "data/p99_latency.csv")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  # --- Helper Function to Load Data ---
15
  def load_data(filepath, separator=','):
@@ -33,281 +224,343 @@ def load_data(filepath, separator=','):
33
  print(f"Error loading {filepath}: {e}")
34
  return pd.DataFrame()
35
 
36
- # --- Load All Data ---
37
- print("Loading data...")
38
- df_summary = load_data(SUMMARY_FILE)
39
- df_domain = load_data(DOMAIN_RANKS_FILE)
40
- df_cost = load_data(COST_FILE)
41
- df_avg_latency = load_data(AVG_LATENCY_FILE)
42
- df_p99_latency = load_data(P99_LATENCY_FILE)
43
- print("Data loading complete.")
44
 
45
- # --- *** NEW: Convert Costs to USD Cents *** ---
46
- COST_COLUMN_SUMMARY = 'Costs (USD)' # IMPORTANT: Check this matches your summary_data.csv header EXACTLY
47
- NEW_COST_COLUMN_SUMMARY = 'Avg Cost ($ Cents)' # This is the new name we'll use
48
-
49
- # Convert summary cost
50
- if not df_summary.empty and COST_COLUMN_SUMMARY in df_summary.columns:
51
- df_summary[COST_COLUMN_SUMMARY] = (pd.to_numeric(df_summary[COST_COLUMN_SUMMARY], errors='coerce') * 100).round(3) # <-- ADDED .round(3)
52
- df_summary.rename(columns={COST_COLUMN_SUMMARY: NEW_COST_COLUMN_SUMMARY}, inplace=True)
53
- print(f"Converted '{COST_COLUMN_SUMMARY}' to $ Cents and renamed to '{NEW_COST_COLUMN_SUMMARY}' in df_summary.")
54
- else:
55
- print(f"Warning: Column '{COST_COLUMN_SUMMARY}' not found in df_summary for conversion.")
56
-
57
- # Convert cost breakdown data
58
- if not df_cost.empty:
59
- # IMPORTANT: Check if your model name column in cost_data.csv is 'model_name' or 'Model Name' etc.
60
- model_col_name = 'model_name' # Adjust if needed
61
- cost_cols = [col for col in df_cost.columns if col != model_col_name]
62
- for col in cost_cols:
63
- # Handle potential non-numeric data gracefully before multiplying
64
- df_cost[col] = (pd.to_numeric(df_cost[col], errors='coerce') * 100).round(3) # <-- ADDED .round(3)
65
- print("Converted cost breakdown columns to $ Cents in df_cost.")
66
- # --- *** End of Cost Conversion *** ---
67
-
68
- # Rename columns for clarity if needed (example for summary)
69
- # Make sure the original names match your CSV headers EXACTLY
70
- try:
71
- df_summary = df_summary.rename(columns={
72
- 'Model Name': 'Model', # If your CSV uses 'Model Name'
73
- # Add other renames here if your CSV headers differ from the target names below
74
- # 'Costs (USD)': 'Avg Cost (USD/response)',
75
- # 'Avg Answer Duration (sec)': 'Avg Latency (s)',
76
- # 'P99 Answer Duration (sec)': 'P99 Latency (s)'
77
- })
78
- # Select and reorder columns for the main table - REMOVED BENCHMARK COLUMNS
79
- summary_cols_display = ['Model', 'AutoBench', NEW_COST_COLUMN_SUMMARY, 'Avg Answer Duration (sec)', 'P99 Answer Duration (sec)']
80
- # Filter to only columns that actually exist after loading and renaming
81
- summary_cols_display = [col for col in summary_cols_display if col in df_summary.columns]
82
- df_summary_display = df_summary[summary_cols_display].copy() # Use .copy() to avoid SettingWithCopyWarning
83
-
84
- # Select columns for the new benchmark comparison table
85
- benchmark_cols = ['Model', 'AutoBench', 'Chatbot Ar.', 'AAI Index', 'MMLU Index']
86
- benchmark_cols = [col for col in benchmark_cols if col in df_summary.columns] # Filter existing
87
- df_benchmark_display = df_summary[benchmark_cols].copy() # Use .copy()
88
-
89
- # Ensure AutoBench score is numeric for sorting BOTH display tables
90
- if 'AutoBench' in df_summary_display.columns:
91
- df_summary_display['AutoBench'] = pd.to_numeric(df_summary_display['AutoBench'], errors='coerce')
92
- df_summary_display.sort_values(by='AutoBench', ascending=False, inplace=True) # Use inplace=True
93
- else:
94
- print("Warning: 'AutoBench' column not found for sorting summary table.")
95
-
96
- if 'AutoBench' in df_benchmark_display.columns:
97
- df_benchmark_display['AutoBench'] = pd.to_numeric(df_benchmark_display['AutoBench'], errors='coerce')
98
- df_benchmark_display.sort_values(by='AutoBench', ascending=False, inplace=True) # Use inplace=True
99
- else:
100
- print("Warning: 'AutoBench' column not found for sorting benchmark table.")
101
-
102
- except KeyError as e:
103
- print(f"Error preparing display columns: Missing key {e}. Check CSV headers and rename mapping.")
104
- df_summary_display = df_summary.copy() # Fallback
105
- df_benchmark_display = pd.DataFrame() # Fallback to empty for benchmark table
106
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
 
108
  # --- Build Gradio App ---
109
  with gr.Blocks(theme=gr.themes.Soft()) as app:
110
  gr.Markdown("# AutoBench LLM Leaderboard")
111
  gr.Markdown(
112
  "Interactive leaderboard for AutoBench, where LLMs rank LLMs' responses. "
113
- "Includes performance, cost, and latency metrics."
114
- "Data updated on April 25, 2025."
115
- "\n\nMore info for this benchmark run: [AutoBench Run 2 Results](https://huggingface.co/blog/PeterKruger/autobench-2nd-run). "
116
- "If you want to know more about AutoBench: [AutoBench Release](https://huggingface.co/blog/PeterKruger/autobench)."
117
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
 
119
  # --- Tab 1: Overall Ranking ---
120
  with gr.Tab("Overall Ranking"):
121
  gr.Markdown("## Overall Model Performance")
122
- # REMOVED benchmark correlations from Markdown
123
- gr.Markdown("Models ranked by AutoBench score. Lower cost ($ Cents) and latency (s) are better.")
124
- # Check if df_summary_display has data before rendering
125
- if not df_summary_display.empty:
126
- # Create a copy specifically for this tab's display and rename the column
127
- df_overall_rank_display = df_summary_display.copy()
128
- if 'AutoBench' in df_overall_rank_display.columns:
129
- df_overall_rank_display.rename(columns={'AutoBench': 'Rank'}, inplace=True)
130
-
131
- gr.DataFrame(
132
- df_overall_rank_display, # Pass the renamed DF
133
- # Adjust datatype length based on potentially fewer columns
134
- datatype=['str'] + ['number'] * (len(df_overall_rank_display.columns) - 1),
135
- interactive=True, # Allows sorting
136
- # height=600 # Adjust height as needed
137
- )
138
- else:
139
- gr.Markdown("_(Summary data failed to load or is empty. Please check `summary_data.csv`)_")
140
 
141
- # --- NEW Tab 1.5: Benchmark Comparison ---
142
  with gr.Tab("Benchmark Comparison"):
143
  gr.Markdown("## Benchmark Comparison")
144
  gr.Markdown("Comparison of AutoBench scores with other popular benchmarks. AutoBench features 82.51% correlation with Chatbot Arena, 83.74% with Artificial Analysis Intelligence Index, and 71.51% with MMLU. Models sorted by AutoBench score.")
145
- if not df_benchmark_display.empty:
146
- gr.DataFrame(
147
- df_benchmark_display,
148
- datatype=['str'] + ['number'] * (len(df_benchmark_display.columns) - 1),
149
- interactive=True # Allow sorting
150
- )
151
- else:
152
- gr.Markdown("_(Benchmark comparison data could not be prepared. Check `summary_data.csv` for 'Chatbot Ar.', 'AAI Index', 'MMLU Index' columns.)_")
153
 
154
- # --- Tab 2: Performance Plots ---
155
  with gr.Tab("Performance Plots"):
156
  gr.Markdown("## Performance Visualizations")
157
  gr.Markdown("Exploring relationships between AutoBench Rank, Latency, and Cost.")
158
 
159
- # Scatter Plot 1 (using summary data)
160
  gr.Markdown("### Rank vs. Average Cost")
161
- if not df_summary.empty and 'AutoBench' in df_summary.columns and NEW_COST_COLUMN_SUMMARY in df_summary.columns:
162
- # Filter out rows where essential plot data might be missing
163
- plot_df = df_summary.dropna(subset=['AutoBench', NEW_COST_COLUMN_SUMMARY, 'Model']).copy()
164
- plot_df[NEW_COST_COLUMN_SUMMARY] = pd.to_numeric(plot_df[NEW_COST_COLUMN_SUMMARY], errors='coerce')
165
- plot_df = plot_df.dropna(subset=[NEW_COST_COLUMN_SUMMARY]) # Drop if cost conversion failed
166
-
167
- if not plot_df.empty:
168
- fig_cost = px.scatter(
169
- plot_df,
170
- x=NEW_COST_COLUMN_SUMMARY,
171
- y="AutoBench",
172
- text="Model", # Show model name near point
173
- log_x=True, # Use log scale for cost
174
- title="AutoBench Rank vs. Average Cost per Response ($ Cents - Log Scale)",
175
- labels={'AutoBench': 'AutoBench Rank', NEW_COST_COLUMN_SUMMARY: 'Avg Cost ($ Cents) - Log Scale'},
176
- hover_data=['Model', 'AutoBench', NEW_COST_COLUMN_SUMMARY, 'Avg Answer Duration (sec)'] # Show details on hover
177
- )
178
- fig_cost.update_traces(textposition='top center')
179
- fig_cost.update_layout(
180
- xaxis_title="Avg Cost ($ Cents) - Log Scale", # Keep bottom axis title
181
- yaxis_title="AutoBench Rank",
182
- width=1000, # Your existing width
183
- height=800, # Your existing height (if you added it)
184
- # --- ADD THE FOLLOWING ---
185
- xaxis2=dict(
186
- overlaying='x', # Link to primary x-axis
187
- matches='x', # Explicitly match primary x-axis properties (like type='log')
188
- side='top', # Position on top
189
- showticklabels=True,# Show the labels (numbers)
190
- showline=True, # Explicitly show the axis line itself
191
- title=None # No title for the top axis
192
- )
193
- # --- END OF ADDITION ---
194
- )
195
- gr.Plot(fig_cost)
196
- else:
197
- gr.Markdown("_(Insufficient valid data for Rank vs Cost plot. Check 'AutoBench' and NEW_COST_COLUMN_SUMMARY columns in `summary_data.csv`)_")
198
- else:
199
- gr.Markdown("_(Summary data failed to load or essential columns missing for Rank vs Cost plot)_")
200
 
201
  # Plot 2: Rank vs Average Latency
202
  gr.Markdown("### Rank vs. Average Latency")
203
- if not df_summary.empty and 'AutoBench' in df_summary.columns and 'Avg Answer Duration (sec)' in df_summary.columns:
204
- # Filter out rows where essential plot data might be missing
205
- plot_df_avg_latency = df_summary.dropna(subset=['AutoBench', 'Avg Answer Duration (sec)', 'Model']).copy()
206
- plot_df_avg_latency['Avg Answer Duration (sec)'] = pd.to_numeric(plot_df_avg_latency['Avg Answer Duration (sec)'], errors='coerce')
207
- plot_df_avg_latency = plot_df_avg_latency.dropna(subset=['Avg Answer Duration (sec)']) # Drop if conversion failed
208
-
209
- if not plot_df_avg_latency.empty:
210
- fig_avg_latency = px.scatter(
211
- plot_df_avg_latency,
212
- x="Avg Answer Duration (sec)",
213
- y="AutoBench",
214
- text="Model",
215
- log_x=True, # Use log scale for latency - adjust if not desired
216
- title="AutoBench Rank vs. Average Latency (Log Scale)",
217
- labels={'AutoBench': 'AutoBench Rank', 'Avg Answer Duration (sec)': 'Avg Latency (s) - Log Scale'},
218
- hover_data=['Model', 'AutoBench', 'Avg Answer Duration (sec)', NEW_COST_COLUMN_SUMMARY]
219
- )
220
- fig_avg_latency.update_traces(textposition='top center')
221
- fig_avg_latency.update_layout(xaxis_title="Avg Latency (s) - Log Scale", yaxis_title="AutoBench Rank", width=1000, height=800)
222
- gr.Plot(fig_avg_latency)
223
- else:
224
- gr.Markdown("_(Insufficient valid data for Rank vs Avg Latency plot. Check 'AutoBench' and 'Avg Answer Duration (sec)' columns in `summary_data.csv`)_")
225
- else:
226
- gr.Markdown("_(Summary data failed to load or essential columns missing for Rank vs Avg Latency plot)_")
227
-
228
 
229
  # Plot 3: Rank vs P99 Latency
230
  gr.Markdown("### Rank vs. P99 Latency")
231
- if not df_summary.empty and 'AutoBench' in df_summary.columns and 'P99 Answer Duration (sec)' in df_summary.columns:
232
- # Filter out rows where essential plot data might be missing
233
- plot_df_p99_latency = df_summary.dropna(subset=['AutoBench', 'P99 Answer Duration (sec)', 'Model']).copy()
234
- plot_df_p99_latency['P99 Answer Duration (sec)'] = pd.to_numeric(plot_df_p99_latency['P99 Answer Duration (sec)'], errors='coerce')
235
- plot_df_p99_latency = plot_df_p99_latency.dropna(subset=['P99 Answer Duration (sec)']) # Drop if conversion failed
236
-
237
- if not plot_df_p99_latency.empty:
238
- fig_p99_latency = px.scatter(
239
- plot_df_p99_latency,
240
- x="P99 Answer Duration (sec)",
241
- y="AutoBench",
242
- text="Model",
243
- log_x=True, # Use log scale for latency - adjust if not desired
244
- title="AutoBench Rank vs. P99 Latency (Log Scale)",
245
- labels={'AutoBench': 'AutoBench Rank', 'P99 Answer Duration (sec)': 'P99 Latency (s) - Log Scale'},
246
- hover_data=['Model', 'AutoBench', 'P99 Answer Duration (sec)', 'Avg Answer Duration (sec)', NEW_COST_COLUMN_SUMMARY]
247
- )
248
- fig_p99_latency.update_traces(textposition='top center')
249
- fig_p99_latency.update_layout(xaxis_title="P99 Latency (s) - Log Scale", yaxis_title="AutoBench Rank", width=1000, height=800)
250
- gr.Plot(fig_p99_latency)
251
- else:
252
- gr.Markdown("_(Insufficient valid data for Rank vs P99 Latency plot. Check 'AutoBench' and 'P99 Answer Duration (sec)' columns in `summary_data.csv`)_")
253
- else:
254
- gr.Markdown("_(Summary data failed to load or essential columns missing for Rank vs P99 Latency plot)_")
255
 
256
- # --- Tab 3: Cost & Latency Analysis ---
257
  with gr.Tab("Cost & Latency Analysis"):
258
  gr.Markdown("## Performance vs. Cost/Latency Trade-offs")
259
 
260
  # Cost Breakdown Table
261
- gr.Markdown("### Cost Breakdown per Domain ($ Cents/Response)") # <-- MODIFIED
262
- if not df_cost.empty:
263
- # Make model name the first column if it exists
264
- if 'model_name' in df_cost.columns:
265
- cols = ['model_name'] + [col for col in df_cost.columns if col != 'model_name']
266
- df_cost_display = df_cost[cols]
267
- else:
268
- df_cost_display = df_cost # Use as is if 'model_name' isn't found
269
- gr.DataFrame(df_cost_display, interactive=True)
270
  else:
271
- gr.Markdown("_(Cost breakdown data failed to load or is empty. Please check `cost_data.csv`)_")
 
 
 
 
 
272
 
273
  # Latency Breakdown Tables
274
  gr.Markdown("### Average Latency Breakdown per Domain (Seconds)")
275
- if not df_avg_latency.empty:
276
- if 'model_name' in df_avg_latency.columns:
277
- cols = ['model_name'] + [col for col in df_avg_latency.columns if col != 'model_name']
278
- df_avg_latency_display = df_avg_latency[cols]
279
- else:
280
- df_avg_latency_display = df_avg_latency
281
- gr.DataFrame(df_avg_latency_display, interactive=True)
282
  else:
283
- gr.Markdown("_(Average latency data failed to load or is empty. Please check `avg_latency.csv`)_")
 
 
 
 
 
284
 
285
  gr.Markdown("### P99 Latency Breakdown per Domain (Seconds)")
286
- if not df_p99_latency.empty:
287
- if 'model_name' in df_p99_latency.columns:
288
- cols = ['model_name'] + [col for col in df_p99_latency.columns if col != 'model_name']
289
- df_p99_latency_display = df_p99_latency[cols]
290
- else:
291
- df_p99_latency_display = df_p99_latency
292
- gr.DataFrame(df_p99_latency_display, interactive=True)
293
  else:
294
- gr.Markdown("_(P99 latency data failed to load or is empty. Please check `p99_latency.csv`)_")
 
 
 
 
 
295
 
296
 
297
- # --- Tab 4: Domain Performance ---
298
  with gr.Tab("Domain Performance"):
299
  gr.Markdown("## Performance Across Different Domains")
300
  gr.Markdown("Model ranks within specific knowledge or task areas. Higher is better.")
301
- if not df_domain.empty:
302
- if 'Model Name' in df_domain.columns:
303
- # Attempt to make Model Name first col
304
- cols = ['Model Name'] + [col for col in df_domain.columns if col != 'Model Name']
305
- df_domain_display = df_domain[cols]
306
- else:
307
- df_domain_display = df_domain # Use as is
308
- gr.DataFrame(df_domain_display, interactive=True)
309
  else:
310
- gr.Markdown("_(Domain ranks data failed to load or is empty. Please check `domain_ranks.csv`)_")
 
 
 
 
 
311
 
312
  # --- Tab 5: About ---
313
  with gr.Tab("About AutoBench"):
@@ -339,8 +592,36 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
339
 
340
  **Disclaimer:** Benchmark results provide one perspective on model capabilities. Performance can vary based on specific tasks, prompts, and API conditions. Costs are estimates and subject to change by providers. Latency depends on server load and geographic location.
341
  """)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
342
 
343
  # --- Launch the App ---
344
  print("Launching Gradio app...")
345
- app.launch()
 
 
 
346
  print("Gradio app launched.")
 
1
  import gradio as gr
2
  import pandas as pd
3
  import plotly.express as px
4
+ import os
5
+ import json
6
+ from typing import Dict, List
7
 
8
  # --- Configuration ---
9
+ RUNS_DIR = "runs"
10
+ DATA_DIR = "." # For backward compatibility
11
+ COST_COLUMN_SUMMARY = 'Costs (USD)'
12
+ NEW_COST_COLUMN_SUMMARY = 'Avg Cost ($ Cents)'
13
+
14
+ # --- Multi-Run Support Functions ---
15
+ def discover_available_runs() -> List[Dict]:
16
+ """Scan runs directory and return sorted list of available runs with metadata."""
17
+ runs = []
18
+
19
+ if not os.path.exists(RUNS_DIR):
20
+ # Fallback to old structure
21
+ if os.path.exists("data"):
22
+ return [{
23
+ "run_id": "legacy",
24
+ "title": "AutoBench Run 2 - April 2025",
25
+ "date": "2025-04-25",
26
+ "description": "Current run data",
27
+ "blog_url": "https://huggingface.co/blog/PeterKruger/autobench-2nd-run",
28
+ "model_count": 27,
29
+ "is_latest": True,
30
+ "path": "data"
31
+ }]
32
+ return []
33
+
34
+ for run_dir in os.listdir(RUNS_DIR):
35
+ run_path = os.path.join(RUNS_DIR, run_dir)
36
+ if os.path.isdir(run_path):
37
+ metadata_path = os.path.join(run_path, "metadata.json")
38
+ if os.path.exists(metadata_path):
39
+ try:
40
+ with open(metadata_path, 'r') as f:
41
+ metadata = json.load(f)
42
+ metadata["path"] = run_path
43
+ runs.append(metadata)
44
+ except Exception as e:
45
+ print(f"Error loading metadata for {run_dir}: {e}")
46
+
47
+ # Sort by date, newest first
48
+ runs.sort(key=lambda x: x.get("date", ""), reverse=True)
49
+ return runs
50
+
51
+ def load_run_metadata(run_id: str) -> Dict:
52
+ """Load metadata for a specific run."""
53
+ runs = discover_available_runs()
54
+ for run in runs:
55
+ if run["run_id"] == run_id:
56
+ return run
57
+ return {}
58
+
59
+ def get_run_file_path(run_path: str, filename: str) -> str:
60
+ """Get the full path to a data file for a specific run."""
61
+ return os.path.join(run_path, filename)
62
+
63
+
64
+ def load_correlations(run_path: str) -> Dict:
65
+ """Load correlation data for a specific run."""
66
+ correlations_file = get_run_file_path(run_path, "correlations.json")
67
+ if os.path.exists(correlations_file):
68
+ try:
69
+ with open(correlations_file, 'r') as f:
70
+ return json.load(f)
71
+ except Exception as e:
72
+ print(f"Error loading correlations from {correlations_file}: {e}")
73
+ return {}
74
+
75
+
76
+ def format_correlations_text(correlations_data: Dict) -> str:
77
+ """Format correlation data into a readable text string."""
78
+ if not correlations_data or 'correlations' not in correlations_data:
79
+ return ""
80
+
81
+ correlations = correlations_data['correlations']
82
+ if not correlations:
83
+ return ""
84
+
85
+ # Format the correlation text
86
+ correlation_parts = []
87
+ for benchmark, percentage in correlations.items():
88
+ correlation_parts.append(f"{percentage}% with {benchmark}")
89
+
90
+ if correlation_parts:
91
+ return f"**Benchmark Correlations:** AutoBench features " + ", ".join(correlation_parts) + "."
92
+ return ""
93
+
94
+ def load_run_data(run_id: str) -> Dict[str, pd.DataFrame]:
95
+ """Load all CSV data for a specific run."""
96
+ runs = discover_available_runs()
97
+ run_metadata = None
98
+
99
+ for run in runs:
100
+ if run["run_id"] == run_id:
101
+ run_metadata = run
102
+ break
103
+
104
+ if not run_metadata:
105
+ print(f"Run {run_id} not found")
106
+ return {}
107
+
108
+ run_path = run_metadata["path"]
109
+
110
+ # Load all data files
111
+ data = {}
112
+ file_mapping = {
113
+ "summary": "summary_data.csv",
114
+ "domain": "domain_ranks.csv",
115
+ "cost": "cost_data.csv",
116
+ "avg_latency": "avg_latency.csv",
117
+ "p99_latency": "p99_latency.csv"
118
+ }
119
+
120
+ for key, filename in file_mapping.items():
121
+ filepath = get_run_file_path(run_path, filename)
122
+ data[key] = load_data(filepath)
123
+
124
+ # Process the data (cost conversion, etc.)
125
+ data = process_run_data(data)
126
+
127
+ # Load correlations
128
+ correlations = load_correlations(run_path)
129
+ data["correlations"] = correlations
130
+
131
+ return data
132
+
133
+ def process_run_data(data: Dict[str, pd.DataFrame]) -> Dict[str, pd.DataFrame]:
134
+ """Process and clean the loaded data (cost conversion, sorting, etc.)."""
135
+ df_summary = data.get("summary", pd.DataFrame())
136
+ df_cost = data.get("cost", pd.DataFrame())
137
+
138
+ # Convert costs to USD cents (existing logic)
139
+ if not df_summary.empty and COST_COLUMN_SUMMARY in df_summary.columns:
140
+ df_summary[COST_COLUMN_SUMMARY] = (pd.to_numeric(df_summary[COST_COLUMN_SUMMARY], errors='coerce') * 100).round(3)
141
+ df_summary.rename(columns={COST_COLUMN_SUMMARY: NEW_COST_COLUMN_SUMMARY}, inplace=True)
142
+
143
+ # Convert cost breakdown data
144
+ if not df_cost.empty:
145
+ model_col_name = 'model_name'
146
+ cost_cols = [col for col in df_cost.columns if col != model_col_name]
147
+ for col in cost_cols:
148
+ df_cost[col] = (pd.to_numeric(df_cost[col], errors='coerce') * 100).round(3)
149
+
150
+ # Rename columns and create display dataframes
151
+ try:
152
+ df_summary = df_summary.rename(columns={'Model Name': 'Model'})
153
+
154
+ # Summary display table - include new columns if they exist
155
+ base_cols = ['Model', 'AutoBench', NEW_COST_COLUMN_SUMMARY, 'Avg Answer Duration (sec)', 'P99 Answer Duration (sec)']
156
+
157
+ # Add new columns at the end: Fail Rate % before Iterations
158
+ if 'Fail Rate %' in df_summary.columns:
159
+ base_cols.append('Fail Rate %')
160
+ if 'Iterations' in df_summary.columns:
161
+ base_cols.append('Iterations')
162
+
163
+ summary_cols_display = [col for col in base_cols if col in df_summary.columns]
164
+ df_summary_display = df_summary[summary_cols_display].copy()
165
+
166
+ # Benchmark display table - handle both old and new column names
167
+ benchmark_cols = ['Model', 'AutoBench']
168
+
169
+ # Handle different column name variations
170
+ chatbot_col = None
171
+ mmlu_col = None
172
+
173
+ for col in df_summary.columns:
174
+ if col in ['Chatbot Ar.', 'LMArena']:
175
+ chatbot_col = col
176
+ elif col in ['MMLU Index', 'MMLU-Pro']:
177
+ mmlu_col = col
178
+
179
+ if chatbot_col:
180
+ benchmark_cols.append(chatbot_col)
181
+ if 'AAI Index' in df_summary.columns:
182
+ benchmark_cols.append('AAI Index')
183
+ if mmlu_col:
184
+ benchmark_cols.append(mmlu_col)
185
+
186
+ benchmark_cols = [col for col in benchmark_cols if col in df_summary.columns]
187
+ df_benchmark_display = df_summary[benchmark_cols].copy()
188
+
189
+ # Sort by AutoBench score
190
+ for df in [df_summary_display, df_benchmark_display]:
191
+ if 'AutoBench' in df.columns:
192
+ df['AutoBench'] = pd.to_numeric(df['AutoBench'], errors='coerce')
193
+ df.sort_values(by='AutoBench', ascending=False, inplace=True)
194
+
195
+ data["summary_display"] = df_summary_display
196
+ data["benchmark_display"] = df_benchmark_display
197
+
198
+ except Exception as e:
199
+ print(f"Error processing display data: {e}")
200
+ data["summary_display"] = df_summary.copy()
201
+ data["benchmark_display"] = pd.DataFrame()
202
+
203
+ return data
204
 
205
  # --- Helper Function to Load Data ---
206
  def load_data(filepath, separator=','):
 
224
  print(f"Error loading {filepath}: {e}")
225
  return pd.DataFrame()
226
 
227
+ # --- Initialize Multi-Run System ---
228
+ print("Discovering available runs...")
229
+ available_runs = discover_available_runs()
230
+ if not available_runs:
231
+ print("No runs found! Please check the runs/ directory structure.")
232
+ exit(1)
 
 
233
 
234
+ # Get the latest run as default
235
+ latest_run = available_runs[0]
236
+ print(f"Found {len(available_runs)} run(s). Latest: {latest_run['title']}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
237
 
238
+ # Initialize with latest run data
239
+ print("Loading latest run data...")
240
+ current_data = load_run_data(latest_run["run_id"])
241
+ print("Data loading complete.")
242
+
243
+ # --- Plotting Functions ---
244
+ def create_cost_scatter_plot(data: Dict[str, pd.DataFrame]) -> tuple:
245
+ """Create the cost vs rank scatter plot."""
246
+ df_summary = data.get("summary", pd.DataFrame())
247
+
248
+ if df_summary.empty or 'AutoBench' not in df_summary.columns or NEW_COST_COLUMN_SUMMARY not in df_summary.columns:
249
+ return None, "_(Insufficient data for Rank vs Cost plot)_"
250
+
251
+ plot_df = df_summary.dropna(subset=['AutoBench', NEW_COST_COLUMN_SUMMARY, 'Model']).copy()
252
+ plot_df[NEW_COST_COLUMN_SUMMARY] = pd.to_numeric(plot_df[NEW_COST_COLUMN_SUMMARY], errors='coerce')
253
+ plot_df = plot_df.dropna(subset=[NEW_COST_COLUMN_SUMMARY])
254
+
255
+ if plot_df.empty:
256
+ return None, "_(No valid data for Rank vs Cost plot)_"
257
+
258
+ fig_cost = px.scatter(
259
+ plot_df,
260
+ x=NEW_COST_COLUMN_SUMMARY,
261
+ y="AutoBench",
262
+ text="Model",
263
+ log_x=True,
264
+ title="AutoBench Rank vs. Average Cost per Response ($ Cents - Log Scale)",
265
+ labels={'AutoBench': 'AutoBench Rank', NEW_COST_COLUMN_SUMMARY: 'Avg Cost ($ Cents) - Log Scale'},
266
+ hover_data=['Model', 'AutoBench', NEW_COST_COLUMN_SUMMARY, 'Avg Answer Duration (sec)']
267
+ )
268
+ fig_cost.update_traces(textposition='top center')
269
+ fig_cost.update_layout(
270
+ xaxis_title="Avg Cost ($ Cents) - Log Scale",
271
+ yaxis_title="AutoBench Rank",
272
+ width=1000,
273
+ height=800,
274
+ xaxis2=dict(
275
+ overlaying='x',
276
+ matches='x',
277
+ side='top',
278
+ showticklabels=True,
279
+ showline=True,
280
+ title=None
281
+ )
282
+ )
283
+ return fig_cost, ""
284
+
285
+ def create_avg_latency_plot(data: Dict[str, pd.DataFrame]) -> tuple:
286
+ """Create the average latency vs rank scatter plot."""
287
+ df_summary = data.get("summary", pd.DataFrame())
288
+
289
+ if df_summary.empty or 'AutoBench' not in df_summary.columns or 'Avg Answer Duration (sec)' not in df_summary.columns:
290
+ return None, "_(Insufficient data for Rank vs Avg Latency plot)_"
291
+
292
+ plot_df = df_summary.dropna(subset=['AutoBench', 'Avg Answer Duration (sec)', 'Model']).copy()
293
+ plot_df['Avg Answer Duration (sec)'] = pd.to_numeric(plot_df['Avg Answer Duration (sec)'], errors='coerce')
294
+ plot_df = plot_df.dropna(subset=['Avg Answer Duration (sec)'])
295
+
296
+ if plot_df.empty:
297
+ return None, "_(No valid data for Rank vs Avg Latency plot)_"
298
+
299
+ fig_latency = px.scatter(
300
+ plot_df,
301
+ x="Avg Answer Duration (sec)",
302
+ y="AutoBench",
303
+ text="Model",
304
+ log_x=True,
305
+ title="AutoBench Rank vs. Average Latency (Log Scale)",
306
+ labels={'AutoBench': 'AutoBench Rank', 'Avg Answer Duration (sec)': 'Avg Latency (s) - Log Scale'},
307
+ hover_data=['Model', 'AutoBench', 'Avg Answer Duration (sec)', NEW_COST_COLUMN_SUMMARY]
308
+ )
309
+ fig_latency.update_traces(textposition='top center')
310
+ fig_latency.update_layout(
311
+ xaxis_title="Avg Latency (s) - Log Scale",
312
+ yaxis_title="AutoBench Rank",
313
+ width=1000,
314
+ height=800
315
+ )
316
+ return fig_latency, ""
317
+
318
+ def create_p99_latency_plot(data: Dict[str, pd.DataFrame]) -> tuple:
319
+ """Create the P99 latency vs rank scatter plot."""
320
+ df_summary = data.get("summary", pd.DataFrame())
321
+
322
+ if df_summary.empty or 'AutoBench' not in df_summary.columns or 'P99 Answer Duration (sec)' not in df_summary.columns:
323
+ return None, "_(Insufficient data for Rank vs P99 Latency plot)_"
324
+
325
+ plot_df = df_summary.dropna(subset=['AutoBench', 'P99 Answer Duration (sec)', 'Model']).copy()
326
+ plot_df['P99 Answer Duration (sec)'] = pd.to_numeric(plot_df['P99 Answer Duration (sec)'], errors='coerce')
327
+ plot_df = plot_df.dropna(subset=['P99 Answer Duration (sec)'])
328
+
329
+ if plot_df.empty:
330
+ return None, "_(No valid data for Rank vs P99 Latency plot)_"
331
+
332
+ fig_p99 = px.scatter(
333
+ plot_df,
334
+ x="P99 Answer Duration (sec)",
335
+ y="AutoBench",
336
+ text="Model",
337
+ log_x=True,
338
+ title="AutoBench Rank vs. P99 Latency (Log Scale)",
339
+ labels={'AutoBench': 'AutoBench Rank', 'P99 Answer Duration (sec)': 'P99 Latency (s) - Log Scale'},
340
+ hover_data=['Model', 'AutoBench', 'P99 Answer Duration (sec)', 'Avg Answer Duration (sec)', NEW_COST_COLUMN_SUMMARY]
341
+ )
342
+ fig_p99.update_traces(textposition='top center')
343
+ fig_p99.update_layout(
344
+ xaxis_title="P99 Latency (s) - Log Scale",
345
+ yaxis_title="AutoBench Rank",
346
+ width=1000,
347
+ height=800
348
+ )
349
+ return fig_p99, ""
350
+
351
+ def update_leaderboard_data(selected_run_id: str) -> tuple:
352
+ """Update all leaderboard components when run selection changes."""
353
+ if not selected_run_id:
354
+ # Return empty/default values for all outputs
355
+ empty_df = pd.DataFrame()
356
+ return (
357
+ empty_df, empty_df, empty_df, empty_df, empty_df, empty_df, # DataFrames
358
+ None, "", None, "", None, "", # Plots and messages
359
+ "No run selected", "" # Info message, correlations text
360
+ )
361
+
362
+ # Load data for selected run
363
+ data = load_run_data(selected_run_id)
364
+ run_metadata = load_run_metadata(selected_run_id)
365
+
366
+ if not data:
367
+ empty_df = pd.DataFrame()
368
+ return (
369
+ empty_df, empty_df, empty_df, empty_df, empty_df, empty_df,
370
+ None, "Error loading data", None, "Error loading data", None, "Error loading data",
371
+ f"Error loading run: {selected_run_id}", ""
372
+ )
373
+
374
+ # Get DataFrames
375
+ summary_display = data.get("summary_display", pd.DataFrame())
376
+ benchmark_display = data.get("benchmark_display", pd.DataFrame())
377
+ cost_df = data.get("cost", pd.DataFrame())
378
+ avg_latency_df = data.get("avg_latency", pd.DataFrame())
379
+ p99_latency_df = data.get("p99_latency", pd.DataFrame())
380
+ domain_df = data.get("domain", pd.DataFrame())
381
+
382
+ # Create rank display (rename AutoBench to Rank for overall ranking tab)
383
+ overall_rank_display = summary_display.copy()
384
+ if 'AutoBench' in overall_rank_display.columns:
385
+ overall_rank_display.rename(columns={'AutoBench': 'Rank'}, inplace=True)
386
+
387
+ # Prepare cost and latency displays with model_name first
388
+ def prepare_table_display(df, model_col='model_name'):
389
+ if df.empty:
390
+ return df
391
+ if model_col in df.columns:
392
+ cols = [model_col] + [col for col in df.columns if col != model_col]
393
+ return df[cols]
394
+ return df
395
+
396
+ cost_display = prepare_table_display(cost_df)
397
+ avg_latency_display = prepare_table_display(avg_latency_df)
398
+ p99_latency_display = prepare_table_display(p99_latency_df)
399
+
400
+ # Prepare domain display
401
+ domain_display = domain_df.copy()
402
+ if 'Model Name' in domain_display.columns:
403
+ cols = ['Model Name'] + [col for col in domain_display.columns if col != 'Model Name']
404
+ domain_display = domain_display[cols]
405
+
406
+ # Create plots
407
+ cost_plot, cost_msg = create_cost_scatter_plot(data)
408
+ avg_latency_plot, avg_latency_msg = create_avg_latency_plot(data)
409
+ p99_latency_plot, p99_latency_msg = create_p99_latency_plot(data)
410
+
411
+ # Create info message
412
+ info_msg = f"**Current Run:** {run_metadata.get('title', 'Unknown')} ({run_metadata.get('date', 'Unknown date')})"
413
+ if 'model_count' in run_metadata:
414
+ info_msg += f" - {run_metadata['model_count']} models"
415
+
416
+ # Get correlation text
417
+ correlations_text = format_correlations_text(data.get("correlations", {}))
418
+
419
+ return (
420
+ overall_rank_display, benchmark_display, cost_display, avg_latency_display, p99_latency_display, domain_display,
421
+ cost_plot, cost_msg, avg_latency_plot, avg_latency_msg, p99_latency_plot, p99_latency_msg,
422
+ info_msg, correlations_text
423
+ )
424
 
425
  # --- Build Gradio App ---
426
  with gr.Blocks(theme=gr.themes.Soft()) as app:
427
  gr.Markdown("# AutoBench LLM Leaderboard")
428
  gr.Markdown(
429
  "Interactive leaderboard for AutoBench, where LLMs rank LLMs' responses. "
430
+ "Includes performance, cost, and latency metrics. "
431
+ "Use the dropdown below to navigate between different benchmark runs."
432
+ )
433
+
434
+ # --- Navigation Section ---
435
+ with gr.Row():
436
+ with gr.Column(scale=3):
437
+ # Create dropdown choices
438
+ run_choices = [(f"{run['date']} - {run['title']}", run['run_id']) for run in available_runs]
439
+ run_selector = gr.Dropdown(
440
+ choices=run_choices,
441
+ value=latest_run["run_id"],
442
+ label="📊 Select AutoBench Run",
443
+ info="Choose a benchmark run to view its results"
444
+ )
445
+ with gr.Column(scale=2):
446
+ current_run_info = gr.Markdown(
447
+ f"**Current Run:** {latest_run['title']} ({latest_run['date']})" +
448
+ (f" - {latest_run['model_count']} models" if 'model_count' in latest_run else "")
449
+ )
450
+
451
+ gr.Markdown("---")
452
 
453
  # --- Tab 1: Overall Ranking ---
454
  with gr.Tab("Overall Ranking"):
455
  gr.Markdown("## Overall Model Performance")
456
+ gr.Markdown("Models ranked by AutoBench score. Lower cost ($ Cents), latency (s), and fail rate (%) are better. Iterations shows the number of evaluations per model.")
457
+
458
+ # Add correlations display
459
+ initial_correlations = format_correlations_text(current_data.get("correlations", {}))
460
+ correlations_display = gr.Markdown(value=initial_correlations)
461
+
462
+ overall_ranking_table = gr.DataFrame(
463
+ current_data.get("summary_display", pd.DataFrame()).copy().rename(columns={'AutoBench': 'Rank'}) if 'AutoBench' in current_data.get("summary_display", pd.DataFrame()).columns else current_data.get("summary_display", pd.DataFrame()),
464
+ interactive=True,
465
+ label="Overall Rankings"
466
+ )
 
 
 
 
 
 
 
467
 
468
+ # --- Tab 2: Benchmark Comparison ---
469
  with gr.Tab("Benchmark Comparison"):
470
  gr.Markdown("## Benchmark Comparison")
471
  gr.Markdown("Comparison of AutoBench scores with other popular benchmarks. AutoBench features 82.51% correlation with Chatbot Arena, 83.74% with Artificial Analysis Intelligence Index, and 71.51% with MMLU. Models sorted by AutoBench score.")
472
+
473
+ benchmark_comparison_table = gr.DataFrame(
474
+ current_data.get("benchmark_display", pd.DataFrame()),
475
+ interactive=True,
476
+ label="Benchmark Comparison"
477
+ )
 
 
478
 
479
+ # --- Tab 3: Performance Plots ---
480
  with gr.Tab("Performance Plots"):
481
  gr.Markdown("## Performance Visualizations")
482
  gr.Markdown("Exploring relationships between AutoBench Rank, Latency, and Cost.")
483
 
484
+ # Scatter Plot 1: Cost vs Rank
485
  gr.Markdown("### Rank vs. Average Cost")
486
+ initial_cost_plot, initial_cost_msg = create_cost_scatter_plot(current_data)
487
+ cost_plot = gr.Plot(value=initial_cost_plot)
488
+ cost_plot_msg = gr.Markdown(value=initial_cost_msg)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
489
 
490
  # Plot 2: Rank vs Average Latency
491
  gr.Markdown("### Rank vs. Average Latency")
492
+ initial_avg_latency_plot, initial_avg_latency_msg = create_avg_latency_plot(current_data)
493
+ avg_latency_plot = gr.Plot(value=initial_avg_latency_plot)
494
+ avg_latency_plot_msg = gr.Markdown(value=initial_avg_latency_msg)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
495
 
496
  # Plot 3: Rank vs P99 Latency
497
  gr.Markdown("### Rank vs. P99 Latency")
498
+ initial_p99_latency_plot, initial_p99_latency_msg = create_p99_latency_plot(current_data)
499
+ p99_latency_plot = gr.Plot(value=initial_p99_latency_plot)
500
+ p99_latency_plot_msg = gr.Markdown(value=initial_p99_latency_msg)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
501
 
502
+ # --- Tab 4: Cost & Latency Analysis ---
503
  with gr.Tab("Cost & Latency Analysis"):
504
  gr.Markdown("## Performance vs. Cost/Latency Trade-offs")
505
 
506
  # Cost Breakdown Table
507
+ gr.Markdown("### Cost Breakdown per Domain ($ Cents/Response)")
508
+ cost_df = current_data.get("cost", pd.DataFrame())
509
+ if not cost_df.empty and 'model_name' in cost_df.columns:
510
+ cols = ['model_name'] + [col for col in cost_df.columns if col != 'model_name']
511
+ initial_cost_display = cost_df[cols]
 
 
 
 
512
  else:
513
+ initial_cost_display = cost_df
514
+ cost_breakdown_table = gr.DataFrame(
515
+ value=initial_cost_display,
516
+ interactive=True,
517
+ label="Cost Breakdown"
518
+ )
519
 
520
  # Latency Breakdown Tables
521
  gr.Markdown("### Average Latency Breakdown per Domain (Seconds)")
522
+ avg_latency_df = current_data.get("avg_latency", pd.DataFrame())
523
+ if not avg_latency_df.empty and 'model_name' in avg_latency_df.columns:
524
+ cols = ['model_name'] + [col for col in avg_latency_df.columns if col != 'model_name']
525
+ initial_avg_latency_display = avg_latency_df[cols]
 
 
 
526
  else:
527
+ initial_avg_latency_display = avg_latency_df
528
+ avg_latency_breakdown_table = gr.DataFrame(
529
+ value=initial_avg_latency_display,
530
+ interactive=True,
531
+ label="Average Latency Breakdown"
532
+ )
533
 
534
  gr.Markdown("### P99 Latency Breakdown per Domain (Seconds)")
535
+ p99_latency_df = current_data.get("p99_latency", pd.DataFrame())
536
+ if not p99_latency_df.empty and 'model_name' in p99_latency_df.columns:
537
+ cols = ['model_name'] + [col for col in p99_latency_df.columns if col != 'model_name']
538
+ initial_p99_latency_display = p99_latency_df[cols]
 
 
 
539
  else:
540
+ initial_p99_latency_display = p99_latency_df
541
+ p99_latency_breakdown_table = gr.DataFrame(
542
+ value=initial_p99_latency_display,
543
+ interactive=True,
544
+ label="P99 Latency Breakdown"
545
+ )
546
 
547
 
548
+ # --- Tab 5: Domain Performance ---
549
  with gr.Tab("Domain Performance"):
550
  gr.Markdown("## Performance Across Different Domains")
551
  gr.Markdown("Model ranks within specific knowledge or task areas. Higher is better.")
552
+
553
+ domain_df = current_data.get("domain", pd.DataFrame())
554
+ if not domain_df.empty and 'Model Name' in domain_df.columns:
555
+ cols = ['Model Name'] + [col for col in domain_df.columns if col != 'Model Name']
556
+ initial_domain_display = domain_df[cols]
 
 
 
557
  else:
558
+ initial_domain_display = domain_df
559
+ domain_performance_table = gr.DataFrame(
560
+ value=initial_domain_display,
561
+ interactive=True,
562
+ label="Domain Performance"
563
+ )
564
 
565
  # --- Tab 5: About ---
566
  with gr.Tab("About AutoBench"):
 
592
 
593
  **Disclaimer:** Benchmark results provide one perspective on model capabilities. Performance can vary based on specific tasks, prompts, and API conditions. Costs are estimates and subject to change by providers. Latency depends on server load and geographic location.
594
  """)
595
+
596
+ # --- Event Handlers ---
597
+ # Set up reactive data loading when run selection changes
598
+ run_selector.change(
599
+ fn=update_leaderboard_data,
600
+ inputs=[run_selector],
601
+ outputs=[
602
+ overall_ranking_table,
603
+ benchmark_comparison_table,
604
+ cost_breakdown_table,
605
+ avg_latency_breakdown_table,
606
+ p99_latency_breakdown_table,
607
+ domain_performance_table,
608
+ cost_plot,
609
+ cost_plot_msg,
610
+ avg_latency_plot,
611
+ avg_latency_plot_msg,
612
+ p99_latency_plot,
613
+ p99_latency_plot_msg,
614
+ current_run_info,
615
+ correlations_display
616
+ ]
617
+ )
618
+
619
+ # Note: Initial data is already loaded via value parameters above
620
 
621
  # --- Launch the App ---
622
  print("Launching Gradio app...")
623
+ app.launch(
624
+ favicon_path="static/manifest.json" if os.path.exists("static/manifest.json") else None,
625
+ show_error=True
626
+ )
627
  print("Gradio app launched.")
{data → runs/run_2025-04-25}/avg_latency.csv RENAMED
File without changes
runs/run_2025-04-25/correlations.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "correlations": {
3
+ "Chatbot Arena": 82.51,
4
+ "Artificial Analysis Intelligence Index": 83.74,
5
+ "MMLU-Plus": 71.51
6
+ },
7
+ "description": "Correlation percentages between AutoBench scores and other benchmark scores"
8
+ }
{data → runs/run_2025-04-25}/cost_data.csv RENAMED
File without changes
{data → runs/run_2025-04-25}/domain_ranks.csv RENAMED
File without changes
runs/run_2025-04-25/metadata.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "run_id": "run_2025-04-25",
3
+ "title": "AutoBench Run 2 - April 2025",
4
+ "date": "2025-04-25",
5
+ "description": "Second major AutoBench run with o4-mini, GPT-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thinking, etc.",
6
+ "blog_url": "https://huggingface.co/blog/PeterKruger/autobench-2nd-run",
7
+ "model_count": 27,
8
+ "is_latest": false
9
+ }
{data → runs/run_2025-04-25}/p99_latency.csv RENAMED
File without changes
{data → runs/run_2025-04-25}/summary_data.csv RENAMED
File without changes
runs/run_2025-08-14/avg_latency.csv ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model_name,coding,creative writing,current news,general culture,grammar,history,logics,math,science,technology,Average (All Topics)
2
+ claude-3.5-haiku,17.2096,9.5021,13.1592,11.2295,9.2722,13.7099,7.9069,9.5013,10.8395,11.2754,11.51902452
3
+ claude-opus-4-1,97.988,31.7501,44.9092,36.3513,30.166,53.8647,38.4643,32.1546,48.1318,52.2831,48.62490598
4
+ claude-sonnet-4,75.4533,18.119,34.977,23.5653,18.9687,35.1883,19.9602,24.1877,36.5889,34.5893,33.66639032
5
+ deepSeek-R1-0528,238.2024,38.1537,62.9681,48.2227,51.7076,61.9942,302.2976,271.5234,62.5406,66.1298,119.174235
6
+ deepSeek-V3-0324,55.0977,17.8165,32.0701,23.041,24.2812,31.001,88.9007,51.2641,29.3776,43.4225,40.30336432
7
+ gemini-2.5-flash,95.8841,16.7289,32.1003,18.3471,24.8943,29.1208,133.8567,52.6201,31.5185,36.1925,48.7078753
8
+ gemini-2.5-flash-lite,26.1777,2.8563,5.6249,3.6215,3.9845,6.8374,86.5258,41.5112,6.4054,9.5453,19.15509939
9
+ gemini-2.5-pro,83.393,29.2572,49.1594,36.6978,36.9932,47.2089,166.9207,93.4631,49.4694,52.5929,65.03115036
10
+ gemma-3-27b-it,62.2708,12.4686,27.5139,18.3155,19.2946,25.0176,24.3704,40.4719,35.3548,23.2773,29.7215
11
+ GLM-4.5,147.0394,20.9164,43.4854,31.8055,38.4103,42.5121,224.8967,165.5218,43.1608,49.7589,80.74437254
12
+ GLM-4.5-Air,105.9142,13.7172,31.2173,19.5371,45.6936,30.9208,206.3031,140.9465,38.1786,44.156,68.34050587
13
+ gpt-4.1,58.0164,11.4923,24.717,14.4706,15.9619,23.1579,80.4132,46.8127,21.9879,23.3983,32.86274006
14
+ gpt-5,120.9672,50.0373,78.8355,55.0956,56.6508,73.3966,156.5955,151.2006,83.2029,76.5455,89.99818067
15
+ gpt-5-mini,102.5153,25.7197,48.4212,31.5236,35.4217,44.406,159.7168,84.2748,52.1692,56.7156,65.89701176
16
+ gpt-5-nano,98.4218,38.2174,52.23,38.2242,48.2844,54.4789,136.6573,86.4832,48.3263,53.912,66.4959839
17
+ gpt-oss-120b,49.9796,11.3648,28.9885,18.2911,14.4024,22.6241,25.4515,36.6005,27.8547,28.8135,27.00733404
18
+ grok-3-mini,45.7707,12.9635,17.7188,13.6096,20.4385,20.4502,40.5232,32.54,25.8136,23.9,26.12147499
19
+ grok-4,92.6961,28.4581,49.7663,33.7181,41.4523,48.4394,138.4335,124.3858,48.7183,50.103,60.95525411
20
+ Kimi-K2-Instruct,69.4739,29.2635,65.4032,45.9564,46.6415,75.2439,114.3645,58.6696,72.5513,57.1711,65.0222057
21
+ llama-3_1-Nemotron-Ultra-253B-v1,88.5095,28.8905,29.2174,22.7454,32.2121,36.1473,174.681,134.8929,38.2931,32.3295,61.53657957
22
+ Llama-3_3-Nemotron-Super-49B-v1,46.3574,15.1811,28.7779,22.9057,21.5446,27.2551,67.4261,33.0439,31.8722,25.0215,32.63831081
23
+ llama-4-maverick,21.5707,4.76,7.8189,5.8389,6.1681,8.3187,21.0604,11.3275,8.7158,6.8338,10.65014104
24
+ llama-4-Scout-17B-16E-Instruct,20.2177,6.4339,9.1924,7.529,7.9851,9.8133,11.1171,13.0721,12.2545,8.2341,10.86684261
25
+ magistral-small-2506,11.4551,7.1952,7.5178,5.7532,6.1051,8.6988,79.2722,37.9617,7.0901,7.2786,17.53939687
26
+ mistral-large-2411,51.7739,14.2815,23.6025,17.3517,13.3736,25.8432,18.1355,24.6234,25.1484,21.9005,24.36368715
27
+ nova-lite-v1,7.1014,4.7882,5.846,4.7061,4.3402,5.5093,4.8806,4.861,4.9134,5.3275,5.288625128
28
+ nova-pro-v1,12.4833,7.5838,7.52,5.6658,6.5254,7.3418,5.8645,7.2712,6.6838,6.7792,7.528192069
29
+ o3,70.4202,25.9427,46.4619,29.613,26.5293,42.6644,194.8085,112.9362,41.2548,46.7826,63.89621339
30
+ o4-mini,56.98,16.3976,26.7274,19.8134,21.7084,23.2641,116.3436,41.5349,28.6513,26.4233,39.05469579
31
+ phi-4,10.8373,5.9498,6.7808,6.3085,5.9981,7.1457,7.7569,12.1669,7.0431,7.4096,7.744667446
32
+ Qwen3-14B,67.7342,19.9239,31.3204,32.178,32.2363,31.2024,197.4205,132.5492,40.1656,31.5221,61.11544056
33
+ Qwen3-235B-A22B-Thinking-2507,180.1429,33.7386,65.2237,45.3004,54.6603,53.7109,122.6611,138.2941,60.427,72.9058,78.79346155
34
+ Qwen3-30B-A3B,119.9895,27.907,34.7837,25.8461,38.8109,37.0577,204.2344,157.6709,38.1969,41.6341,72.64171253
runs/run_2025-08-14/correlations.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "correlations": {
3
+ "LMArena": 86.85,
4
+ "Artificial Analysis Intelligence Index": 92.17,
5
+ "MMLU": 75.44
6
+ },
7
+ "description": "Correlation percentages between AutoBench scores and other benchmark scores"
8
+ }
runs/run_2025-08-14/cost_data.csv ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model_name,coding,creative writing,current news,general culture,grammar,history,logics,math,science,technology,Average (All Topics)
2
+ claude-3.5-haiku,0.01433364,0.00596023,0.00783403,0.00687656,0.00700846,0.00828981,0.00670457,0.0086506,0.00762436,0.00770914,0.008262832
3
+ claude-opus-4-1,0.1854432,0.058055,0.07953592,0.07093359,0.06338077,0.091515,0.07756427,0.08967857,0.08113929,0.08624571,0.091256434
4
+ claude-sonnet-4,0.03735534,0.00986831,0.01471113,0.01164281,0.01133654,0.01593981,0.01523071,0.0181211,0.01523195,0.01548664,0.017099466
5
+ deepSeek-R1-0528,0.01314631,0.00240864,0.00361531,0.0028925,0.00321982,0.00348657,0.01360484,0.01570098,0.00335462,0.00345558,0.006382309
6
+ deepSeek-V3-0324,0.00179803,0.00076659,0.0010843,0.00076717,0.0009645,0.00102303,0.00176716,0.00161885,0.00101519,0.00100663,0.00119639
7
+ gemini-2.5-flash,0.01009034,0.00149757,0.00386032,0.00235404,0.00327702,0.00373479,0.00399646,0.00651437,0.0042002,0.00425649,0.004512314
8
+ gemini-2.5-flash-lite,0.00162929,0.00022925,0.00043297,0.0002676,0.00036001,0.00074135,0.00284197,0.00300094,0.00046755,0.00076045,0.001052718
9
+ gemini-2.5-pro,0.0277487,0.0072651,0.01509773,0.01122996,0.01217529,0.01489145,0.01518476,0.02214276,0.0161067,0.01515628,0.015866994
10
+ gemma-3-27b-it,0.00044567,0.00017776,0.00028088,0.00018688,0.00021991,0.00027132,0.00025736,0.0004104,0.00026623,0.00026777,0.00028134
11
+ GLM-4.5,0.01215121,0.00224537,0.00373941,0.00300118,0.00374437,0.0037643,0.01209785,0.01441371,0.00379349,0.00407664,0.00629521
12
+ GLM-4.5-Air,0.0064693,0.00115415,0.00196991,0.0013692,0.00310372,0.00188536,0.00667506,0.00810732,0.00243774,0.00270236,0.003611243
13
+ gpt-4.1,0.0127812,0.00475815,0.00799795,0.00526612,0.00723564,0.00747097,0.01396449,0.01694627,0.007576,0.0074939,0.009144648
14
+ gpt-5,0.06007105,0.02747785,0.03786611,0.03067602,0.03328859,0.03684483,0.0619945,0.07589079,0.03760841,0.03823268,0.043676351
15
+ gpt-5-mini,0.00837399,0.00360921,0.00562431,0.00423233,0.00503336,0.00521421,0.00885311,0.01098604,0.00557968,0.00584673,0.006324841
16
+ gpt-5-nano,0.00322071,0.00188066,0.00179711,0.00158044,0.00258222,0.00217255,0.00350634,0.00381871,0.00171226,0.00193456,0.002414388
17
+ gpt-oss-120b,0.00204009,0.00075235,0.0013959,0.00103317,0.00088107,0.00124919,0.00158693,0.00206557,0.00120019,0.00136573,0.001361942
18
+ grok-3-mini,0.00135005,0.00053652,0.000744,0.00052427,0.00074544,0.00073632,0.00146137,0.00130681,0.00070784,0.00073552,0.000895661
19
+ grok-4,0.0510268,0.01488985,0.02296405,0.01615106,0.02757851,0.02201092,0.05122846,0.05969514,0.02105693,0.02263429,0.029202342
20
+ Kimi-K2-Instruct,0.0028794,0.00174574,0.00235794,0.00221802,0.00168557,0.00288834,0.00275612,0.00216027,0.00241402,0.00230324,0.002379296
21
+ llama-3_1-Nemotron-Ultra-253B-v1,0.00518958,0.00210383,0.00166994,0.00144186,0.00232614,0.00206072,0.00771408,0.0084011,0.00209152,0.00188066,0.003451924
22
+ Llama-3_3-Nemotron-Super-49B-v1,0.00061338,0.00029891,0.0003921,0.00035633,0.00037593,0.00042511,0.00067755,0.00059406,0.00041489,0.00037202,0.000455359
23
+ llama-4-maverick,0.00073583,0.0003091,0.00044116,0.00034576,0.0004487,0.00045704,0.00067601,0.00068581,0.00041993,0.00040759,0.000496574
24
+ llama-4-Scout-17B-16E-Instruct,0.00052871,0.00029487,0.00041163,0.00037054,0.00037521,0.00042884,0.00043854,0.00049328,0.00037896,0.00036888,0.000410484
25
+ magistral-small-2506,0.00209764,0.00099426,0.00100711,0.00073844,0.00105969,0.00101964,0.00613572,0.00519953,0.0010145,0.00099893,0.0019781
26
+ mistral-large-2411,0.00963771,0.004614,0.00608889,0.00459562,0.00456308,0.00608816,0.00568067,0.0072976,0.00577418,0.00586219,0.006101138
27
+ nova-lite-v1,0.00028518,0.00013034,0.0001725,0.00012972,0.00014658,0.00016856,0.00020718,0.0002494,0.00016274,0.00015855,0.00018322
28
+ nova-pro-v1,0.00286661,0.00146443,0.00169594,0.00126732,0.00182435,0.00161185,0.00167262,0.00247683,0.00151393,0.00145992,0.001800498
29
+ o3,0.01830136,0.00943369,0.01515805,0.01014862,0.01051015,0.0133993,0.03758871,0.05157964,0.01258745,0.01355057,0.018504711
30
+ o4-mini,0.01033712,0.00629079,0.00714936,0.00606561,0.00734834,0.00687444,0.01597855,0.01248218,0.0066754,0.00744339,0.008704364
31
+ phi-4,0.00034647,0.00019431,0.00021457,0.00017458,0.00019261,0.00022752,0.00024432,0.00038072,0.00021456,0.00020954,0.000240436
32
+ Qwen3-14B,0.00098626,0.00036762,0.00045077,0.00043898,0.00056254,0.00047318,0.0018592,0.00189101,0.0005008,0.00047842,0.000789184
33
+ Qwen3-235B-A22B-Thinking-2507,0.00447462,0.00278202,0.00439546,0.003571,0.0041131,0.00377739,0.00673892,0.00660976,0.00369867,0.00371842,0.00416518
34
+ Qwen3-30B-A3B,0.00120847,0.00039109,0.00046062,0.00038347,0.00054571,0.00046903,0.00150189,0.00185211,0.00046031,0.0004507,0.000763337
runs/run_2025-08-14/domain_ranks.csv ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model_name,coding,creative writing,current news,general culture,grammar,history,logics,math,science,technology,Average (All Topics)
2
+ claude-3.5-haiku,3.4733,3.8572,3.7443,3.995,3.7371,3.8701,2.8162,2.7809,3.736,3.8134,3.586292962
3
+ claude-opus-4-1,4.2931,4.5071,4.3035,4.4302,4.2258,4.441,3.5738,3.5758,4.4164,4.4833,4.239909895
4
+ claude-sonnet-4,4.1894,4.3647,4.3026,4.3258,4.2532,4.3497,3.5475,3.4817,4.3953,4.3884,4.171968576
5
+ deepSeek-R1-0528,3.9481,4.3147,4.3493,4.4062,4.3139,4.4007,3.5649,3.6287,4.3876,4.4032,4.18112906
6
+ deepSeek-V3-0324,3.8756,4.1946,4.0724,4.0561,4.0467,4.0888,3.3667,3.4401,4.1442,4.0976,3.945669087
7
+ gemini-2.5-flash,4.4225,4.1694,4.3729,4.3287,4.3781,4.4165,4.0091,4.2283,4.4283,4.3877,4.32099389
8
+ gemini-2.5-flash-lite,4.1092,4.1468,4.0836,4.1655,4.0513,4.1563,3.3399,3.546,4.2419,4.1944,4.017202952
9
+ gemini-2.5-pro,4.5248,4.3916,4.4224,4.4873,4.4508,4.477,4.0868,4.2425,4.5154,4.516,4.416904571
10
+ gemma-3-27b-it,3.5655,4.2891,4.112,4.1677,3.973,4.1788,3.0395,3.0841,4.1903,4.1951,3.881640548
11
+ GLM-4.5,3.8921,4.2566,4.3827,4.4219,4.3206,4.4692,3.4848,3.4666,4.4781,4.4931,4.176558031
12
+ GLM-4.5-Air,3.8049,3.993,4.1921,4.193,4.0296,4.2921,3.4191,3.3372,4.2467,4.2717,3.98464018
13
+ gpt-4.1,4.2419,4.3243,4.2551,4.186,4.2089,4.2291,3.7372,3.7882,4.2677,4.3183,4.165890881
14
+ gpt-5,4.5821,4.5178,4.5866,4.6213,4.359,4.6423,4.2072,4.1718,4.6466,4.6634,4.511567341
15
+ gpt-5-mini,4.545,4.5442,4.4995,4.5239,4.442,4.5635,4.1788,4.2467,4.6203,4.6302,4.486571107
16
+ gpt-5-nano,4.4143,4.3848,4.3524,4.403,4.3042,4.4475,3.88,4.134,4.3676,4.5159,4.325926956
17
+ gpt-oss-120b,4.612,4.4161,4.5248,4.4229,4.4453,4.5703,4.162,4.2461,4.6282,4.634,4.479287977
18
+ grok-3-mini,4.0184,4.1848,4.1622,4.2055,4.1589,4.2079,3.5142,3.4923,4.2585,4.263,4.055940505
19
+ grok-4,4.3075,4.3543,4.3302,4.3823,4.3368,4.3983,4.0111,3.8461,4.4058,4.4033,4.308828831
20
+ Kimi-K2-Instruct,4.119,4.5362,4.2929,4.3457,4.1929,4.5009,3.3958,3.5199,4.4592,4.4092,4.177138663
21
+ llama-3_1-Nemotron-Ultra-253B-v1,3.7715,4.268,4.1844,4.2204,4.0862,4.2596,3.4729,3.434,4.2285,4.2351,4.020264345
22
+ Llama-3_3-Nemotron-Super-49B-v1,3.8343,4.0472,4.0449,4.1217,3.9872,4.1624,3.0436,3.2692,4.1175,4.1322,3.883310532
23
+ llama-4-maverick,3.5884,3.7355,3.7029,3.7833,3.8314,3.8241,3.1303,3.0961,3.8317,3.7879,3.640194992
24
+ llama-4-Scout-17B-16E-Instruct,3.3725,3.8585,3.6597,3.8316,3.8462,3.8386,3.0535,3.0033,3.8224,3.8368,3.614481399
25
+ magistral-small-2506,3.7448,3.2301,3.9232,3.8931,3.8409,3.9707,3.2159,3.2791,4.0028,3.941,3.713933337
26
+ mistral-large-2411,3.4967,3.9724,3.8286,3.9329,3.7992,3.919,3.1123,3.1132,3.9871,3.9484,3.714675671
27
+ nova-lite-v1,3.322,3.7767,3.7078,3.7683,3.5057,3.7565,2.9917,2.9507,3.8237,3.7503,3.538201832
28
+ nova-pro-v1,3.3633,3.8403,3.5455,3.723,3.5315,3.633,2.9588,2.8514,3.7492,3.6051,3.490835422
29
+ o3,4.4254,4.2963,4.5626,4.4951,4.3871,4.5722,3.9576,4.1618,4.579,4.6123,4.409851586
30
+ o4-mini,4.3056,4.2389,4.3587,4.3787,4.2565,4.3495,3.8986,3.8362,4.4838,4.5318,4.27410734
31
+ phi-4,3.4825,3.9651,3.7302,3.849,3.6624,3.8171,3.1286,3.1995,3.8704,3.8465,3.657791802
32
+ Qwen3-14B,3.833,4.2818,4.1127,4.0911,4.0235,4.149,3.4441,3.3376,4.2349,4.1606,3.976245179
33
+ Qwen3-235B-A22B-Thinking-2507,4.3112,4.4415,4.4798,4.5551,4.4452,4.5117,3.8366,3.9413,4.4376,4.5403,4.394399183
34
+ Qwen3-30B-A3B,3.8019,4.1652,4.0949,4.1569,3.9359,4.145,3.4857,3.3944,4.1322,4.1468,3.952481327
runs/run_2025-08-14/metadata.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "run_id": "run_2025-08-14",
3
+ "title": "AutoBench Run 3 - August 2025",
4
+ "date": "2025-08-14",
5
+ "description": "Latest AutoBench run with enhanced metrics including evaluation iterations and fail rates",
6
+ "blog_url": "https://huggingface.co/blog/PeterKruger/autobench-3rd-run",
7
+ "model_count": 34,
8
+ "is_latest": true
9
+ }
runs/run_2025-08-14/p99_latency.csv ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model_name,coding,creative writing,current news,general culture,grammar,history,logics,math,science,technology,Average (All Topics)
2
+ GLM-4.5,336.5838,46.1539,115.4247,61.5081,183.8738,94.3667,955.208,354.6674,147.6845,164.993,246.0464
3
+ GLM-4.5-Air,294.8103,49.5964,75.2865,47.3232,188.0317,122.7531,934.9063,326.1799,192.6793,173.4233,240.499
4
+ Kimi-K2-Instruct,328.1743,127.0559,498.215,155.9084,411.2174,374.8414,919.1317,342.6293,464.4898,283.0215,390.4685
5
+ Llama-3_3-Nemotron-Super-49B-v1,215.9947,28.1436,78.589,55.0476,69.6513,64.6465,672.344,82.6381,150.7472,96.6115,151.4413
6
+ Qwen3-14B,291.1851,58.0622,88.5603,117.7344,88.0704,115.429,952.7636,353.0349,203.2735,124.2363,239.235
7
+ Qwen3-235B-A22B-Thinking-2507,666.7428,62.6109,184.4234,141.1792,188.1752,130.8262,431.5208,488.4891,231.8124,312.6447,283.8425
8
+ Qwen3-30B-A3B,302.3182,64.1154,92.9827,77.8548,121.5989,100.3155,973.6302,352.3788,165.0523,180.4958,243.0743
9
+ claude-3.5-haiku,41.3124,14.9481,33.5752,18.5466,17.7297,36.6532,15.8231,40.0595,17.8959,16.9296,25.3473
10
+ claude-opus-4-1,411.8235,66.277,99.7769,67.8099,73.13,140.073,240.5076,85.4884,194.4168,172.176,155.1479
11
+ claude-sonnet-4,372.4893,48.2756,92.4428,50.91,54.3981,124.7965,53.7468,57.4402,181.6048,159.8677,119.5972
12
+ deepSeek-R1-0528,516.7523,74.2918,114.0859,72.9274,112.0078,127.0512,839.4571,432.4486,182.3988,184.1239,265.5545
13
+ deepSeek-V3-0324,186.8884,51.2986,88.1105,69.8525,65.7204,121.314,755.2438,202.0879,100.6333,355.9609,199.711
14
+ gemini-2.5-flash,712.3231,38.3328,103.7981,43.5623,74.993,117.5583,944.4005,122.8572,135.3642,147.9211,244.1111
15
+ gemini-2.5-flash-lite,190.8247,6.2927,19.3252,12.1569,13.2804,47.8915,602.4712,222.2597,49.1969,110.7355,127.4435
16
+ gemini-2.5-pro,240.7699,52.3828,97.3472,57.9532,71.3161,111.1321,714.4278,300.57,170.1939,177.3255,199.3419
17
+ gemma-3-27b-it,375.5933,26.3165,60.8123,46.1448,66.8804,99.0397,180.0912,228.1774,193.2272,68.8631,134.5146
18
+ gpt-4.1,373.0435,20.9727,103.6544,34.6418,45.1164,68.0267,580.148,268.4173,151.755,161.6432,180.7419
19
+ gpt-5,379.0229,104.1814,151.0171,109.2785,163.0493,141.3236,655.5064,536.5325,304.2283,232.5815,277.6722
20
+ gpt-5-mini,420.1856,55.5139,107.6471,65.403,94.1406,96.4831,710.367,304.143,221.4093,238.5054,231.3798
21
+ gpt-5-nano,452.6803,63.8721,123.3952,95.2713,98.3822,131.3145,649.7221,349.2375,145.5956,209.7373,231.9208
22
+ gpt-oss-120b,219.9099,40.9979,88.2543,59.7898,50.0398,66.35,154.4168,213.958,154.396,143.3909,119.1503
23
+ grok-3-mini,324.7266,28.8678,38.3573,27.4405,58.7259,56.7006,303.2076,79.7071,164.6423,78.6302,116.1006
24
+ grok-4,330.6722,73.9998,112.4368,75.3662,148.4834,118.3984,908.2874,484.3592,205.0356,168.1656,262.5205
25
+ llama-3_1-Nemotron-Ultra-253B-v1,299.2445,64.7177,80.6145,53.2406,111.7416,114.9641,677.2227,364.4893,179.9651,73.4696,201.967
26
+ llama-4-Scout-17B-16E-Instruct,119.4443,17.908,21.721,15.137,15.5893,21.4368,21.6394,35.6977,109.5036,18.1442,39.6221
27
+ llama-4-maverick,258.2904,12.3067,28.3245,14.1619,15.9541,23.4693,237.7464,50.0955,44.6007,26.4254,71.1375
28
+ magistral-small-2506,50.6671,23.6896,23.0028,17.8342,14.257,27.6318,461.2929,227.3066,22.4139,27.139,89.5235
29
+ mistral-large-2411,320.7227,28.5833,76.1094,50.5788,34.0307,104.2414,69.1314,52.9922,161.3657,71.1036,96.8859
30
+ nova-lite-v1,17.362,9.0387,11.6702,10.1896,8.0435,9.778,8.2672,7.7491,9.4719,11.1956,10.2766
31
+ nova-pro-v1,55.831,13.3815,14.2866,9.7714,24.3369,15.6141,14.2894,23.7601,15.2509,15.0352,20.1557
32
+ o3,370.6262,215.2039,126.7157,84.4048,96.7106,130.4733,970.1118,427.7559,179.8835,165.4601,276.7346
33
+ o4-mini,317.1998,49.1399,78.8689,45.2952,67.6997,52.9028,768.0834,246.0076,143.2397,86.9799,185.5417
34
+ phi-4,28.1176,10.3654,12.3853,13.4812,13.4604,12.3491,14.116,39.9468,13.6159,34.0316,19.1869
runs/run_2025-08-14/summary_data.csv ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Model,Iterations,AutoBench,LMArena,AAI Index,MMLU-Pro,Costs (USD),Avg Answer Duration (sec),P99 Answer Duration (sec),Fail Rate %
2
+ claude-3.5-haiku,393,3.586292962,1317,23326,0.634,0.008262832,11.51902452,17.98,4.15%
3
+ claude-opus-4-1,387,4.239909895,1446,58830,,0.091256434,48.62490598,32.86,5.61%
4
+ claude-sonnet-4,393,4.171968576,1399,61000,0.842,0.017099466,33.66639032,82.6,4.15%
5
+ deepSeek-R1-0528,385,4.18112906,1418,58740,0.849,0.006382309,119.174235,223.47,6.10%
6
+ deepSeek-V3-0324,392,3.945669087,1390,43990,0.819,0.00119639,40.30336432,106.53,4.39%
7
+ gemini-2.5-flash,387,4.32099389,1409,58430,0.759,0.004512314,48.7078753,140.54,5.61%
8
+ gemini-2.5-flash-lite,389,4.017202952,1351,44348,0.832,0.001052718,19.15509939,8.82,5.12%
9
+ gemini-2.5-pro,388,4.416904571,1458,64630,0.862,0.015866994,65.03115036,64.18,5.37%
10
+ gemma-3-27b-it,393,3.881640548,1363,25220,0.669,0.00028134,29.7215,79.12,4.15%
11
+ GLM-4.5,389,4.176558031,1414,56080,0.835,0.00629521,80.74437254,29.19,5.12%
12
+ GLM-4.5-Air,392,3.98464018,1379,49475,0.815,0.003611243,68.34050587,21.75,4.39%
13
+ gpt-4.1,392,4.165890881,1406,46770,0.806,0.009144648,32.86274006,23.32,4.39%
14
+ gpt-5,385,4.511567341,1481,68950,0.871,0.043676351,89.99818067,69.79,6.10%
15
+ gpt-5-mini,392,4.486571107,,63700,0.828,0.006324841,65.89701176,48.74,4.39%
16
+ gpt-5-nano,390,4.325926956,,53780,0.772,0.002414388,66.4959839,73.7,4.88%
17
+ gpt-oss-120b,388,4.479287977,1356,61340,0.808,0.001361942,27.00733404,94.45,5.37%
18
+ grok-3-mini,391,4.055940505,1360,58010,0.828,0.000895661,26.12147499,23.11,4.63%
19
+ grok-4,360,4.308828831,1430,67520,0.866,0.029202342,60.95525411,13.82,12.20%
20
+ Kimi-K2-Instruct,325,4.177138663,1420,48560,0.824,0.002379296,65.0222057,96.77,20.73%
21
+ llama-3_1-Nemotron-Ultra-253B-v1,391,4.020264345,1345,46420,0.825,0.003451924,61.53657957,29.62,4.63%
22
+ llama-3_3-Nemotron-Super-49B-v1,392,3.883310532,1324,40473,0.698,0.000455359,32.63831081,12.47,4.39%
23
+ llama-4-maverick,388,3.640194992,1330,41730,0.809,0.000496574,10.65014104,9.93,5.37%
24
+ llama-4-Scout,393,3.614481399,1318,33060,0.752,0.000410484,10.86684261,23.67,4.15%
25
+ magistral-small-2506,390,3.713933337,1347,35950,0.746,0.0019781,17.53939687,52.3,4.88%
26
+ mistral-large-2411,392,3.714675671,1313,27013,0.697,0.006101138,24.36368715,66.7,4.39%
27
+ nova-lite-v1,393,3.538201832,1262,24540,0.59,0.00018322,5.288625128,21.75,4.15%
28
+ nova-pro-v1,389,3.490835422,1289,28830,0.691,0.001800498,7.528192069,23.32,5.12%
29
+ o3,391,4.409851586,1451,67070,0.853,0.018504711,63.89621339,69.79,4.63%
30
+ o4-mini,393,4.27410734,1398,65050,0.832,0.008704364,39.05469579,48.74,4.15%
31
+ phi-4,392,3.657791802,1258,27950,0.714,0.000240436,7.744667446,73.7,4.39%
32
+ Qwen3-14B,392,3.976245179,,45235,0.774,0.000789184,61.11544056,94.45,4.39%
33
+ Qwen3-235B-A22B-Thinking-2507,331,4.394399183,1401,63590,0.843,0.00416518,78.79346155,23.11,19.27%
34
+ Qwen3-30B-A3B,390,3.952481327,1380,42340,0.777,0.000763337,72.64171253,13.82,4.88%
static/manifest.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "AutoBench Leaderboard",
3
+ "short_name": "AutoBench",
4
+ "description": "Interactive leaderboard for AutoBench LLM evaluations",
5
+ "start_url": "/",
6
+ "display": "standalone",
7
+ "background_color": "#ffffff",
8
+ "theme_color": "#000000",
9
+ "icons": []
10
+ }