Spaces:
Running
Running
Commit
·
df35d1a
1
Parent(s):
1a0aefb
Add multi-leaderboard support with navigation, enhanced metrics, and correlations
Browse files- README.md +151 -2
- app.py +517 -236
- {data → runs/run_2025-04-25}/avg_latency.csv +0 -0
- runs/run_2025-04-25/correlations.json +8 -0
- {data → runs/run_2025-04-25}/cost_data.csv +0 -0
- {data → runs/run_2025-04-25}/domain_ranks.csv +0 -0
- runs/run_2025-04-25/metadata.json +9 -0
- {data → runs/run_2025-04-25}/p99_latency.csv +0 -0
- {data → runs/run_2025-04-25}/summary_data.csv +0 -0
- runs/run_2025-08-14/avg_latency.csv +34 -0
- runs/run_2025-08-14/correlations.json +8 -0
- runs/run_2025-08-14/cost_data.csv +34 -0
- runs/run_2025-08-14/domain_ranks.csv +34 -0
- runs/run_2025-08-14/metadata.json +9 -0
- runs/run_2025-08-14/p99_latency.csv +34 -0
- runs/run_2025-08-14/summary_data.csv +34 -0
- static/manifest.json +10 -0
README.md
CHANGED
@@ -8,7 +8,156 @@ sdk_version: 5.27.0
|
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
license: mit
|
11 |
-
short_description:
|
12 |
---
|
13 |
|
14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
license: mit
|
11 |
+
short_description: Interactive multi-run leaderboard for AutoBench LLM evaluations with historical navigation
|
12 |
---
|
13 |
|
14 |
+
# AutoBench LLM Leaderboard
|
15 |
+
|
16 |
+
Interactive leaderboard for AutoBench, where Large Language Models (LLMs) evaluate and rank responses from other LLMs. This application supports multiple benchmark runs with seamless navigation between different time periods.
|
17 |
+
|
18 |
+
## 🌟 Features
|
19 |
+
|
20 |
+
### Multi-Run Navigation
|
21 |
+
- **📊 Run Selector**: Switch between different AutoBench runs using the dropdown menu
|
22 |
+
- **🕐 Historical Data**: View and compare results across different time periods
|
23 |
+
- **🔄 Reactive Interface**: All tabs and visualizations update automatically when switching runs
|
24 |
+
- **📈 Enhanced Metrics**: Support for evaluation iterations and fail rates in newer runs
|
25 |
+
|
26 |
+
### Comprehensive Analysis
|
27 |
+
- **Overall Ranking**: Model performance with AutoBench scores, costs, latency, and reliability metrics
|
28 |
+
- **Benchmark Comparison**: Correlations with Chatbot Arena, AAI Index, and MMLU benchmarks
|
29 |
+
- **Performance Plots**: Interactive scatter plots showing cost vs. performance trade-offs
|
30 |
+
- **Cost & Latency Analysis**: Detailed breakdown by domain and response time percentiles
|
31 |
+
- **Domain Performance**: Model rankings across specific knowledge areas
|
32 |
+
|
33 |
+
### Dynamic Features
|
34 |
+
- **📊 Benchmark Correlations**: Displays correlation percentages with other popular benchmarks
|
35 |
+
- **💰 Cost Conversion**: Automatic conversion to cents for better readability
|
36 |
+
- **⚡ Performance Metrics**: Average and P99 latency measurements
|
37 |
+
- **🎯 Fail Rate Tracking**: Model reliability metrics (for supported runs)
|
38 |
+
- **🔢 Iteration Counts**: Number of evaluations per model (for supported runs)
|
39 |
+
|
40 |
+
## 🚀 How to Use
|
41 |
+
|
42 |
+
### Navigation
|
43 |
+
1. **Select a Run**: Use the dropdown menu at the top to choose between available benchmark runs
|
44 |
+
2. **Explore Tabs**: Navigate through different analysis views using the tab interface
|
45 |
+
3. **Interactive Tables**: Sort and filter data by clicking on column headers
|
46 |
+
4. **Hover for Details**: Get additional information by hovering over chart elements
|
47 |
+
|
48 |
+
### Understanding the Data
|
49 |
+
- **AutoBench Score**: Higher scores indicate better performance
|
50 |
+
- **Cost**: Lower values are better (displayed in cents per response)
|
51 |
+
- **Latency**: Lower response times are better (average and P99 percentiles)
|
52 |
+
- **Fail Rate**: Lower percentages indicate more reliable models
|
53 |
+
- **Iterations**: Number of evaluation attempts per model
|
54 |
+
|
55 |
+
## 🔧 Adding New Runs
|
56 |
+
|
57 |
+
### Directory Structure
|
58 |
+
```
|
59 |
+
runs/
|
60 |
+
├── run_YYYY-MM-DD/
|
61 |
+
│ ├── metadata.json # Run information and metadata
|
62 |
+
│ ├── correlations.json # Benchmark correlation data
|
63 |
+
│ ├── summary_data.csv # Main leaderboard data
|
64 |
+
│ ├── domain_ranks.csv # Domain-specific rankings
|
65 |
+
│ ├── cost_data.csv # Cost breakdown by domain
|
66 |
+
│ ├── avg_latency.csv # Average latency by domain
|
67 |
+
│ └── p99_latency.csv # P99 latency by domain
|
68 |
+
```
|
69 |
+
|
70 |
+
### Required Files
|
71 |
+
|
72 |
+
#### 1. metadata.json
|
73 |
+
```json
|
74 |
+
{
|
75 |
+
"run_id": "run_2025-08-14",
|
76 |
+
"title": "AutoBench Run 3 - August 2025",
|
77 |
+
"date": "2025-08-14",
|
78 |
+
"description": "Latest AutoBench run with enhanced metrics",
|
79 |
+
"blog_url": "https://huggingface.co/blog/PeterKruger/autobench-3rd-run",
|
80 |
+
"model_count": 34,
|
81 |
+
"is_latest": true
|
82 |
+
}
|
83 |
+
```
|
84 |
+
|
85 |
+
#### 2. correlations.json
|
86 |
+
```json
|
87 |
+
{
|
88 |
+
"correlations": {
|
89 |
+
"Chatbot Arena": 82.51,
|
90 |
+
"Artificial Analysis Intelligence Index": 83.74,
|
91 |
+
"MMLU": 71.51
|
92 |
+
},
|
93 |
+
"description": "Correlation percentages between AutoBench scores and other benchmark scores"
|
94 |
+
}
|
95 |
+
```
|
96 |
+
|
97 |
+
#### 3. summary_data.csv
|
98 |
+
Required columns:
|
99 |
+
- `Model`: Model name
|
100 |
+
- `AutoBench`: AutoBench score
|
101 |
+
- `Costs (USD)`: Cost per response in USD
|
102 |
+
- `Avg Answer Duration (sec)`: Average response time
|
103 |
+
- `P99 Answer Duration (sec)`: 99th percentile response time
|
104 |
+
|
105 |
+
Optional columns (for enhanced metrics):
|
106 |
+
- `Iterations`: Number of evaluation iterations
|
107 |
+
- `Fail Rate %`: Percentage of failed responses
|
108 |
+
- `LMArena` or `Chatbot Ar.`: Chatbot Arena scores
|
109 |
+
- `MMLU-Pro` or `MMLU Index`: MMLU benchmark scores
|
110 |
+
- `AAI Index`: Artificial Analysis Intelligence Index scores
|
111 |
+
|
112 |
+
### Adding a New Run
|
113 |
+
|
114 |
+
1. **Create Directory**: `mkdir runs/run_YYYY-MM-DD`
|
115 |
+
2. **Add Data Files**: Copy your CSV files to the new directory
|
116 |
+
3. **Create Metadata**: Add `metadata.json` with run information
|
117 |
+
4. **Add Correlations**: Create `correlations.json` with benchmark correlations
|
118 |
+
5. **Update Previous Run**: Set `"is_latest": false` in the previous latest run's metadata
|
119 |
+
6. **Restart App**: The new run will be automatically discovered
|
120 |
+
|
121 |
+
### Column Compatibility
|
122 |
+
|
123 |
+
The application automatically adapts to different column structures:
|
124 |
+
- **Legacy Runs**: Support basic columns (Model, AutoBench, Cost, Latency)
|
125 |
+
- **Enhanced Runs**: Include additional metrics (Iterations, Fail Rate %)
|
126 |
+
- **Flexible Naming**: Handles variations in benchmark column names
|
127 |
+
|
128 |
+
## 🛠️ Development
|
129 |
+
|
130 |
+
### Requirements
|
131 |
+
- Python 3.8+
|
132 |
+
- Gradio 5.27.0+
|
133 |
+
- Pandas
|
134 |
+
- Plotly
|
135 |
+
|
136 |
+
### Installation
|
137 |
+
```bash
|
138 |
+
pip install -r requirements.txt
|
139 |
+
```
|
140 |
+
|
141 |
+
### Running Locally
|
142 |
+
```bash
|
143 |
+
python app.py
|
144 |
+
```
|
145 |
+
|
146 |
+
### killing all python processes
|
147 |
+
```bash
|
148 |
+
taskkill /F /IM python.exe 2>/dev/null || echo "No Python processes to kill"
|
149 |
+
```
|
150 |
+
|
151 |
+
The app will automatically discover available runs and launch on a local port.
|
152 |
+
|
153 |
+
## 📊 Data Sources
|
154 |
+
|
155 |
+
AutoBench evaluations are conducted using LLM-generated questions across diverse domains, with responses ranked by evaluation LLMs. For more information about the methodology, visit the [AutoBench blog posts](https://huggingface.co/blog/PeterKruger/autobench).
|
156 |
+
|
157 |
+
## 📄 License
|
158 |
+
|
159 |
+
MIT License - see LICENSE file for details.
|
160 |
+
|
161 |
+
---
|
162 |
+
|
163 |
+
Check out the [Hugging Face Spaces configuration reference](https://huggingface.co/docs/hub/spaces-config-reference) for deployment options.
|
app.py
CHANGED
@@ -1,15 +1,206 @@
|
|
1 |
import gradio as gr
|
2 |
import pandas as pd
|
3 |
import plotly.express as px
|
4 |
-
import os
|
|
|
|
|
5 |
|
6 |
# --- Configuration ---
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
|
14 |
# --- Helper Function to Load Data ---
|
15 |
def load_data(filepath, separator=','):
|
@@ -33,281 +224,343 @@ def load_data(filepath, separator=','):
|
|
33 |
print(f"Error loading {filepath}: {e}")
|
34 |
return pd.DataFrame()
|
35 |
|
36 |
-
# ---
|
37 |
-
print("
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
df_p99_latency = load_data(P99_LATENCY_FILE)
|
43 |
-
print("Data loading complete.")
|
44 |
|
45 |
-
#
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
# Convert summary cost
|
50 |
-
if not df_summary.empty and COST_COLUMN_SUMMARY in df_summary.columns:
|
51 |
-
df_summary[COST_COLUMN_SUMMARY] = (pd.to_numeric(df_summary[COST_COLUMN_SUMMARY], errors='coerce') * 100).round(3) # <-- ADDED .round(3)
|
52 |
-
df_summary.rename(columns={COST_COLUMN_SUMMARY: NEW_COST_COLUMN_SUMMARY}, inplace=True)
|
53 |
-
print(f"Converted '{COST_COLUMN_SUMMARY}' to $ Cents and renamed to '{NEW_COST_COLUMN_SUMMARY}' in df_summary.")
|
54 |
-
else:
|
55 |
-
print(f"Warning: Column '{COST_COLUMN_SUMMARY}' not found in df_summary for conversion.")
|
56 |
-
|
57 |
-
# Convert cost breakdown data
|
58 |
-
if not df_cost.empty:
|
59 |
-
# IMPORTANT: Check if your model name column in cost_data.csv is 'model_name' or 'Model Name' etc.
|
60 |
-
model_col_name = 'model_name' # Adjust if needed
|
61 |
-
cost_cols = [col for col in df_cost.columns if col != model_col_name]
|
62 |
-
for col in cost_cols:
|
63 |
-
# Handle potential non-numeric data gracefully before multiplying
|
64 |
-
df_cost[col] = (pd.to_numeric(df_cost[col], errors='coerce') * 100).round(3) # <-- ADDED .round(3)
|
65 |
-
print("Converted cost breakdown columns to $ Cents in df_cost.")
|
66 |
-
# --- *** End of Cost Conversion *** ---
|
67 |
-
|
68 |
-
# Rename columns for clarity if needed (example for summary)
|
69 |
-
# Make sure the original names match your CSV headers EXACTLY
|
70 |
-
try:
|
71 |
-
df_summary = df_summary.rename(columns={
|
72 |
-
'Model Name': 'Model', # If your CSV uses 'Model Name'
|
73 |
-
# Add other renames here if your CSV headers differ from the target names below
|
74 |
-
# 'Costs (USD)': 'Avg Cost (USD/response)',
|
75 |
-
# 'Avg Answer Duration (sec)': 'Avg Latency (s)',
|
76 |
-
# 'P99 Answer Duration (sec)': 'P99 Latency (s)'
|
77 |
-
})
|
78 |
-
# Select and reorder columns for the main table - REMOVED BENCHMARK COLUMNS
|
79 |
-
summary_cols_display = ['Model', 'AutoBench', NEW_COST_COLUMN_SUMMARY, 'Avg Answer Duration (sec)', 'P99 Answer Duration (sec)']
|
80 |
-
# Filter to only columns that actually exist after loading and renaming
|
81 |
-
summary_cols_display = [col for col in summary_cols_display if col in df_summary.columns]
|
82 |
-
df_summary_display = df_summary[summary_cols_display].copy() # Use .copy() to avoid SettingWithCopyWarning
|
83 |
-
|
84 |
-
# Select columns for the new benchmark comparison table
|
85 |
-
benchmark_cols = ['Model', 'AutoBench', 'Chatbot Ar.', 'AAI Index', 'MMLU Index']
|
86 |
-
benchmark_cols = [col for col in benchmark_cols if col in df_summary.columns] # Filter existing
|
87 |
-
df_benchmark_display = df_summary[benchmark_cols].copy() # Use .copy()
|
88 |
-
|
89 |
-
# Ensure AutoBench score is numeric for sorting BOTH display tables
|
90 |
-
if 'AutoBench' in df_summary_display.columns:
|
91 |
-
df_summary_display['AutoBench'] = pd.to_numeric(df_summary_display['AutoBench'], errors='coerce')
|
92 |
-
df_summary_display.sort_values(by='AutoBench', ascending=False, inplace=True) # Use inplace=True
|
93 |
-
else:
|
94 |
-
print("Warning: 'AutoBench' column not found for sorting summary table.")
|
95 |
-
|
96 |
-
if 'AutoBench' in df_benchmark_display.columns:
|
97 |
-
df_benchmark_display['AutoBench'] = pd.to_numeric(df_benchmark_display['AutoBench'], errors='coerce')
|
98 |
-
df_benchmark_display.sort_values(by='AutoBench', ascending=False, inplace=True) # Use inplace=True
|
99 |
-
else:
|
100 |
-
print("Warning: 'AutoBench' column not found for sorting benchmark table.")
|
101 |
-
|
102 |
-
except KeyError as e:
|
103 |
-
print(f"Error preparing display columns: Missing key {e}. Check CSV headers and rename mapping.")
|
104 |
-
df_summary_display = df_summary.copy() # Fallback
|
105 |
-
df_benchmark_display = pd.DataFrame() # Fallback to empty for benchmark table
|
106 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
107 |
|
108 |
# --- Build Gradio App ---
|
109 |
with gr.Blocks(theme=gr.themes.Soft()) as app:
|
110 |
gr.Markdown("# AutoBench LLM Leaderboard")
|
111 |
gr.Markdown(
|
112 |
"Interactive leaderboard for AutoBench, where LLMs rank LLMs' responses. "
|
113 |
-
"Includes performance, cost, and latency metrics."
|
114 |
-
"
|
115 |
-
|
116 |
-
|
117 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
118 |
|
119 |
# --- Tab 1: Overall Ranking ---
|
120 |
with gr.Tab("Overall Ranking"):
|
121 |
gr.Markdown("## Overall Model Performance")
|
122 |
-
|
123 |
-
|
124 |
-
#
|
125 |
-
|
126 |
-
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
# Adjust datatype length based on potentially fewer columns
|
134 |
-
datatype=['str'] + ['number'] * (len(df_overall_rank_display.columns) - 1),
|
135 |
-
interactive=True, # Allows sorting
|
136 |
-
# height=600 # Adjust height as needed
|
137 |
-
)
|
138 |
-
else:
|
139 |
-
gr.Markdown("_(Summary data failed to load or is empty. Please check `summary_data.csv`)_")
|
140 |
|
141 |
-
# ---
|
142 |
with gr.Tab("Benchmark Comparison"):
|
143 |
gr.Markdown("## Benchmark Comparison")
|
144 |
gr.Markdown("Comparison of AutoBench scores with other popular benchmarks. AutoBench features 82.51% correlation with Chatbot Arena, 83.74% with Artificial Analysis Intelligence Index, and 71.51% with MMLU. Models sorted by AutoBench score.")
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
|
149 |
-
|
150 |
-
|
151 |
-
else:
|
152 |
-
gr.Markdown("_(Benchmark comparison data could not be prepared. Check `summary_data.csv` for 'Chatbot Ar.', 'AAI Index', 'MMLU Index' columns.)_")
|
153 |
|
154 |
-
# --- Tab
|
155 |
with gr.Tab("Performance Plots"):
|
156 |
gr.Markdown("## Performance Visualizations")
|
157 |
gr.Markdown("Exploring relationships between AutoBench Rank, Latency, and Cost.")
|
158 |
|
159 |
-
# Scatter Plot 1
|
160 |
gr.Markdown("### Rank vs. Average Cost")
|
161 |
-
|
162 |
-
|
163 |
-
|
164 |
-
plot_df[NEW_COST_COLUMN_SUMMARY] = pd.to_numeric(plot_df[NEW_COST_COLUMN_SUMMARY], errors='coerce')
|
165 |
-
plot_df = plot_df.dropna(subset=[NEW_COST_COLUMN_SUMMARY]) # Drop if cost conversion failed
|
166 |
-
|
167 |
-
if not plot_df.empty:
|
168 |
-
fig_cost = px.scatter(
|
169 |
-
plot_df,
|
170 |
-
x=NEW_COST_COLUMN_SUMMARY,
|
171 |
-
y="AutoBench",
|
172 |
-
text="Model", # Show model name near point
|
173 |
-
log_x=True, # Use log scale for cost
|
174 |
-
title="AutoBench Rank vs. Average Cost per Response ($ Cents - Log Scale)",
|
175 |
-
labels={'AutoBench': 'AutoBench Rank', NEW_COST_COLUMN_SUMMARY: 'Avg Cost ($ Cents) - Log Scale'},
|
176 |
-
hover_data=['Model', 'AutoBench', NEW_COST_COLUMN_SUMMARY, 'Avg Answer Duration (sec)'] # Show details on hover
|
177 |
-
)
|
178 |
-
fig_cost.update_traces(textposition='top center')
|
179 |
-
fig_cost.update_layout(
|
180 |
-
xaxis_title="Avg Cost ($ Cents) - Log Scale", # Keep bottom axis title
|
181 |
-
yaxis_title="AutoBench Rank",
|
182 |
-
width=1000, # Your existing width
|
183 |
-
height=800, # Your existing height (if you added it)
|
184 |
-
# --- ADD THE FOLLOWING ---
|
185 |
-
xaxis2=dict(
|
186 |
-
overlaying='x', # Link to primary x-axis
|
187 |
-
matches='x', # Explicitly match primary x-axis properties (like type='log')
|
188 |
-
side='top', # Position on top
|
189 |
-
showticklabels=True,# Show the labels (numbers)
|
190 |
-
showline=True, # Explicitly show the axis line itself
|
191 |
-
title=None # No title for the top axis
|
192 |
-
)
|
193 |
-
# --- END OF ADDITION ---
|
194 |
-
)
|
195 |
-
gr.Plot(fig_cost)
|
196 |
-
else:
|
197 |
-
gr.Markdown("_(Insufficient valid data for Rank vs Cost plot. Check 'AutoBench' and NEW_COST_COLUMN_SUMMARY columns in `summary_data.csv`)_")
|
198 |
-
else:
|
199 |
-
gr.Markdown("_(Summary data failed to load or essential columns missing for Rank vs Cost plot)_")
|
200 |
|
201 |
# Plot 2: Rank vs Average Latency
|
202 |
gr.Markdown("### Rank vs. Average Latency")
|
203 |
-
|
204 |
-
|
205 |
-
|
206 |
-
plot_df_avg_latency['Avg Answer Duration (sec)'] = pd.to_numeric(plot_df_avg_latency['Avg Answer Duration (sec)'], errors='coerce')
|
207 |
-
plot_df_avg_latency = plot_df_avg_latency.dropna(subset=['Avg Answer Duration (sec)']) # Drop if conversion failed
|
208 |
-
|
209 |
-
if not plot_df_avg_latency.empty:
|
210 |
-
fig_avg_latency = px.scatter(
|
211 |
-
plot_df_avg_latency,
|
212 |
-
x="Avg Answer Duration (sec)",
|
213 |
-
y="AutoBench",
|
214 |
-
text="Model",
|
215 |
-
log_x=True, # Use log scale for latency - adjust if not desired
|
216 |
-
title="AutoBench Rank vs. Average Latency (Log Scale)",
|
217 |
-
labels={'AutoBench': 'AutoBench Rank', 'Avg Answer Duration (sec)': 'Avg Latency (s) - Log Scale'},
|
218 |
-
hover_data=['Model', 'AutoBench', 'Avg Answer Duration (sec)', NEW_COST_COLUMN_SUMMARY]
|
219 |
-
)
|
220 |
-
fig_avg_latency.update_traces(textposition='top center')
|
221 |
-
fig_avg_latency.update_layout(xaxis_title="Avg Latency (s) - Log Scale", yaxis_title="AutoBench Rank", width=1000, height=800)
|
222 |
-
gr.Plot(fig_avg_latency)
|
223 |
-
else:
|
224 |
-
gr.Markdown("_(Insufficient valid data for Rank vs Avg Latency plot. Check 'AutoBench' and 'Avg Answer Duration (sec)' columns in `summary_data.csv`)_")
|
225 |
-
else:
|
226 |
-
gr.Markdown("_(Summary data failed to load or essential columns missing for Rank vs Avg Latency plot)_")
|
227 |
-
|
228 |
|
229 |
# Plot 3: Rank vs P99 Latency
|
230 |
gr.Markdown("### Rank vs. P99 Latency")
|
231 |
-
|
232 |
-
|
233 |
-
|
234 |
-
plot_df_p99_latency['P99 Answer Duration (sec)'] = pd.to_numeric(plot_df_p99_latency['P99 Answer Duration (sec)'], errors='coerce')
|
235 |
-
plot_df_p99_latency = plot_df_p99_latency.dropna(subset=['P99 Answer Duration (sec)']) # Drop if conversion failed
|
236 |
-
|
237 |
-
if not plot_df_p99_latency.empty:
|
238 |
-
fig_p99_latency = px.scatter(
|
239 |
-
plot_df_p99_latency,
|
240 |
-
x="P99 Answer Duration (sec)",
|
241 |
-
y="AutoBench",
|
242 |
-
text="Model",
|
243 |
-
log_x=True, # Use log scale for latency - adjust if not desired
|
244 |
-
title="AutoBench Rank vs. P99 Latency (Log Scale)",
|
245 |
-
labels={'AutoBench': 'AutoBench Rank', 'P99 Answer Duration (sec)': 'P99 Latency (s) - Log Scale'},
|
246 |
-
hover_data=['Model', 'AutoBench', 'P99 Answer Duration (sec)', 'Avg Answer Duration (sec)', NEW_COST_COLUMN_SUMMARY]
|
247 |
-
)
|
248 |
-
fig_p99_latency.update_traces(textposition='top center')
|
249 |
-
fig_p99_latency.update_layout(xaxis_title="P99 Latency (s) - Log Scale", yaxis_title="AutoBench Rank", width=1000, height=800)
|
250 |
-
gr.Plot(fig_p99_latency)
|
251 |
-
else:
|
252 |
-
gr.Markdown("_(Insufficient valid data for Rank vs P99 Latency plot. Check 'AutoBench' and 'P99 Answer Duration (sec)' columns in `summary_data.csv`)_")
|
253 |
-
else:
|
254 |
-
gr.Markdown("_(Summary data failed to load or essential columns missing for Rank vs P99 Latency plot)_")
|
255 |
|
256 |
-
# --- Tab
|
257 |
with gr.Tab("Cost & Latency Analysis"):
|
258 |
gr.Markdown("## Performance vs. Cost/Latency Trade-offs")
|
259 |
|
260 |
# Cost Breakdown Table
|
261 |
-
gr.Markdown("### Cost Breakdown per Domain ($ Cents/Response)")
|
262 |
-
|
263 |
-
|
264 |
-
|
265 |
-
|
266 |
-
df_cost_display = df_cost[cols]
|
267 |
-
else:
|
268 |
-
df_cost_display = df_cost # Use as is if 'model_name' isn't found
|
269 |
-
gr.DataFrame(df_cost_display, interactive=True)
|
270 |
else:
|
271 |
-
|
|
|
|
|
|
|
|
|
|
|
272 |
|
273 |
# Latency Breakdown Tables
|
274 |
gr.Markdown("### Average Latency Breakdown per Domain (Seconds)")
|
275 |
-
|
276 |
-
|
277 |
-
|
278 |
-
|
279 |
-
else:
|
280 |
-
df_avg_latency_display = df_avg_latency
|
281 |
-
gr.DataFrame(df_avg_latency_display, interactive=True)
|
282 |
else:
|
283 |
-
|
|
|
|
|
|
|
|
|
|
|
284 |
|
285 |
gr.Markdown("### P99 Latency Breakdown per Domain (Seconds)")
|
286 |
-
|
287 |
-
|
288 |
-
|
289 |
-
|
290 |
-
else:
|
291 |
-
df_p99_latency_display = df_p99_latency
|
292 |
-
gr.DataFrame(df_p99_latency_display, interactive=True)
|
293 |
else:
|
294 |
-
|
|
|
|
|
|
|
|
|
|
|
295 |
|
296 |
|
297 |
-
# --- Tab
|
298 |
with gr.Tab("Domain Performance"):
|
299 |
gr.Markdown("## Performance Across Different Domains")
|
300 |
gr.Markdown("Model ranks within specific knowledge or task areas. Higher is better.")
|
301 |
-
|
302 |
-
|
303 |
-
|
304 |
-
|
305 |
-
|
306 |
-
else:
|
307 |
-
df_domain_display = df_domain # Use as is
|
308 |
-
gr.DataFrame(df_domain_display, interactive=True)
|
309 |
else:
|
310 |
-
|
|
|
|
|
|
|
|
|
|
|
311 |
|
312 |
# --- Tab 5: About ---
|
313 |
with gr.Tab("About AutoBench"):
|
@@ -339,8 +592,36 @@ with gr.Blocks(theme=gr.themes.Soft()) as app:
|
|
339 |
|
340 |
**Disclaimer:** Benchmark results provide one perspective on model capabilities. Performance can vary based on specific tasks, prompts, and API conditions. Costs are estimates and subject to change by providers. Latency depends on server load and geographic location.
|
341 |
""")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
342 |
|
343 |
# --- Launch the App ---
|
344 |
print("Launching Gradio app...")
|
345 |
-
app.launch(
|
|
|
|
|
|
|
346 |
print("Gradio app launched.")
|
|
|
1 |
import gradio as gr
|
2 |
import pandas as pd
|
3 |
import plotly.express as px
|
4 |
+
import os
|
5 |
+
import json
|
6 |
+
from typing import Dict, List
|
7 |
|
8 |
# --- Configuration ---
|
9 |
+
RUNS_DIR = "runs"
|
10 |
+
DATA_DIR = "." # For backward compatibility
|
11 |
+
COST_COLUMN_SUMMARY = 'Costs (USD)'
|
12 |
+
NEW_COST_COLUMN_SUMMARY = 'Avg Cost ($ Cents)'
|
13 |
+
|
14 |
+
# --- Multi-Run Support Functions ---
|
15 |
+
def discover_available_runs() -> List[Dict]:
|
16 |
+
"""Scan runs directory and return sorted list of available runs with metadata."""
|
17 |
+
runs = []
|
18 |
+
|
19 |
+
if not os.path.exists(RUNS_DIR):
|
20 |
+
# Fallback to old structure
|
21 |
+
if os.path.exists("data"):
|
22 |
+
return [{
|
23 |
+
"run_id": "legacy",
|
24 |
+
"title": "AutoBench Run 2 - April 2025",
|
25 |
+
"date": "2025-04-25",
|
26 |
+
"description": "Current run data",
|
27 |
+
"blog_url": "https://huggingface.co/blog/PeterKruger/autobench-2nd-run",
|
28 |
+
"model_count": 27,
|
29 |
+
"is_latest": True,
|
30 |
+
"path": "data"
|
31 |
+
}]
|
32 |
+
return []
|
33 |
+
|
34 |
+
for run_dir in os.listdir(RUNS_DIR):
|
35 |
+
run_path = os.path.join(RUNS_DIR, run_dir)
|
36 |
+
if os.path.isdir(run_path):
|
37 |
+
metadata_path = os.path.join(run_path, "metadata.json")
|
38 |
+
if os.path.exists(metadata_path):
|
39 |
+
try:
|
40 |
+
with open(metadata_path, 'r') as f:
|
41 |
+
metadata = json.load(f)
|
42 |
+
metadata["path"] = run_path
|
43 |
+
runs.append(metadata)
|
44 |
+
except Exception as e:
|
45 |
+
print(f"Error loading metadata for {run_dir}: {e}")
|
46 |
+
|
47 |
+
# Sort by date, newest first
|
48 |
+
runs.sort(key=lambda x: x.get("date", ""), reverse=True)
|
49 |
+
return runs
|
50 |
+
|
51 |
+
def load_run_metadata(run_id: str) -> Dict:
|
52 |
+
"""Load metadata for a specific run."""
|
53 |
+
runs = discover_available_runs()
|
54 |
+
for run in runs:
|
55 |
+
if run["run_id"] == run_id:
|
56 |
+
return run
|
57 |
+
return {}
|
58 |
+
|
59 |
+
def get_run_file_path(run_path: str, filename: str) -> str:
|
60 |
+
"""Get the full path to a data file for a specific run."""
|
61 |
+
return os.path.join(run_path, filename)
|
62 |
+
|
63 |
+
|
64 |
+
def load_correlations(run_path: str) -> Dict:
|
65 |
+
"""Load correlation data for a specific run."""
|
66 |
+
correlations_file = get_run_file_path(run_path, "correlations.json")
|
67 |
+
if os.path.exists(correlations_file):
|
68 |
+
try:
|
69 |
+
with open(correlations_file, 'r') as f:
|
70 |
+
return json.load(f)
|
71 |
+
except Exception as e:
|
72 |
+
print(f"Error loading correlations from {correlations_file}: {e}")
|
73 |
+
return {}
|
74 |
+
|
75 |
+
|
76 |
+
def format_correlations_text(correlations_data: Dict) -> str:
|
77 |
+
"""Format correlation data into a readable text string."""
|
78 |
+
if not correlations_data or 'correlations' not in correlations_data:
|
79 |
+
return ""
|
80 |
+
|
81 |
+
correlations = correlations_data['correlations']
|
82 |
+
if not correlations:
|
83 |
+
return ""
|
84 |
+
|
85 |
+
# Format the correlation text
|
86 |
+
correlation_parts = []
|
87 |
+
for benchmark, percentage in correlations.items():
|
88 |
+
correlation_parts.append(f"{percentage}% with {benchmark}")
|
89 |
+
|
90 |
+
if correlation_parts:
|
91 |
+
return f"**Benchmark Correlations:** AutoBench features " + ", ".join(correlation_parts) + "."
|
92 |
+
return ""
|
93 |
+
|
94 |
+
def load_run_data(run_id: str) -> Dict[str, pd.DataFrame]:
|
95 |
+
"""Load all CSV data for a specific run."""
|
96 |
+
runs = discover_available_runs()
|
97 |
+
run_metadata = None
|
98 |
+
|
99 |
+
for run in runs:
|
100 |
+
if run["run_id"] == run_id:
|
101 |
+
run_metadata = run
|
102 |
+
break
|
103 |
+
|
104 |
+
if not run_metadata:
|
105 |
+
print(f"Run {run_id} not found")
|
106 |
+
return {}
|
107 |
+
|
108 |
+
run_path = run_metadata["path"]
|
109 |
+
|
110 |
+
# Load all data files
|
111 |
+
data = {}
|
112 |
+
file_mapping = {
|
113 |
+
"summary": "summary_data.csv",
|
114 |
+
"domain": "domain_ranks.csv",
|
115 |
+
"cost": "cost_data.csv",
|
116 |
+
"avg_latency": "avg_latency.csv",
|
117 |
+
"p99_latency": "p99_latency.csv"
|
118 |
+
}
|
119 |
+
|
120 |
+
for key, filename in file_mapping.items():
|
121 |
+
filepath = get_run_file_path(run_path, filename)
|
122 |
+
data[key] = load_data(filepath)
|
123 |
+
|
124 |
+
# Process the data (cost conversion, etc.)
|
125 |
+
data = process_run_data(data)
|
126 |
+
|
127 |
+
# Load correlations
|
128 |
+
correlations = load_correlations(run_path)
|
129 |
+
data["correlations"] = correlations
|
130 |
+
|
131 |
+
return data
|
132 |
+
|
133 |
+
def process_run_data(data: Dict[str, pd.DataFrame]) -> Dict[str, pd.DataFrame]:
|
134 |
+
"""Process and clean the loaded data (cost conversion, sorting, etc.)."""
|
135 |
+
df_summary = data.get("summary", pd.DataFrame())
|
136 |
+
df_cost = data.get("cost", pd.DataFrame())
|
137 |
+
|
138 |
+
# Convert costs to USD cents (existing logic)
|
139 |
+
if not df_summary.empty and COST_COLUMN_SUMMARY in df_summary.columns:
|
140 |
+
df_summary[COST_COLUMN_SUMMARY] = (pd.to_numeric(df_summary[COST_COLUMN_SUMMARY], errors='coerce') * 100).round(3)
|
141 |
+
df_summary.rename(columns={COST_COLUMN_SUMMARY: NEW_COST_COLUMN_SUMMARY}, inplace=True)
|
142 |
+
|
143 |
+
# Convert cost breakdown data
|
144 |
+
if not df_cost.empty:
|
145 |
+
model_col_name = 'model_name'
|
146 |
+
cost_cols = [col for col in df_cost.columns if col != model_col_name]
|
147 |
+
for col in cost_cols:
|
148 |
+
df_cost[col] = (pd.to_numeric(df_cost[col], errors='coerce') * 100).round(3)
|
149 |
+
|
150 |
+
# Rename columns and create display dataframes
|
151 |
+
try:
|
152 |
+
df_summary = df_summary.rename(columns={'Model Name': 'Model'})
|
153 |
+
|
154 |
+
# Summary display table - include new columns if they exist
|
155 |
+
base_cols = ['Model', 'AutoBench', NEW_COST_COLUMN_SUMMARY, 'Avg Answer Duration (sec)', 'P99 Answer Duration (sec)']
|
156 |
+
|
157 |
+
# Add new columns at the end: Fail Rate % before Iterations
|
158 |
+
if 'Fail Rate %' in df_summary.columns:
|
159 |
+
base_cols.append('Fail Rate %')
|
160 |
+
if 'Iterations' in df_summary.columns:
|
161 |
+
base_cols.append('Iterations')
|
162 |
+
|
163 |
+
summary_cols_display = [col for col in base_cols if col in df_summary.columns]
|
164 |
+
df_summary_display = df_summary[summary_cols_display].copy()
|
165 |
+
|
166 |
+
# Benchmark display table - handle both old and new column names
|
167 |
+
benchmark_cols = ['Model', 'AutoBench']
|
168 |
+
|
169 |
+
# Handle different column name variations
|
170 |
+
chatbot_col = None
|
171 |
+
mmlu_col = None
|
172 |
+
|
173 |
+
for col in df_summary.columns:
|
174 |
+
if col in ['Chatbot Ar.', 'LMArena']:
|
175 |
+
chatbot_col = col
|
176 |
+
elif col in ['MMLU Index', 'MMLU-Pro']:
|
177 |
+
mmlu_col = col
|
178 |
+
|
179 |
+
if chatbot_col:
|
180 |
+
benchmark_cols.append(chatbot_col)
|
181 |
+
if 'AAI Index' in df_summary.columns:
|
182 |
+
benchmark_cols.append('AAI Index')
|
183 |
+
if mmlu_col:
|
184 |
+
benchmark_cols.append(mmlu_col)
|
185 |
+
|
186 |
+
benchmark_cols = [col for col in benchmark_cols if col in df_summary.columns]
|
187 |
+
df_benchmark_display = df_summary[benchmark_cols].copy()
|
188 |
+
|
189 |
+
# Sort by AutoBench score
|
190 |
+
for df in [df_summary_display, df_benchmark_display]:
|
191 |
+
if 'AutoBench' in df.columns:
|
192 |
+
df['AutoBench'] = pd.to_numeric(df['AutoBench'], errors='coerce')
|
193 |
+
df.sort_values(by='AutoBench', ascending=False, inplace=True)
|
194 |
+
|
195 |
+
data["summary_display"] = df_summary_display
|
196 |
+
data["benchmark_display"] = df_benchmark_display
|
197 |
+
|
198 |
+
except Exception as e:
|
199 |
+
print(f"Error processing display data: {e}")
|
200 |
+
data["summary_display"] = df_summary.copy()
|
201 |
+
data["benchmark_display"] = pd.DataFrame()
|
202 |
+
|
203 |
+
return data
|
204 |
|
205 |
# --- Helper Function to Load Data ---
|
206 |
def load_data(filepath, separator=','):
|
|
|
224 |
print(f"Error loading {filepath}: {e}")
|
225 |
return pd.DataFrame()
|
226 |
|
227 |
+
# --- Initialize Multi-Run System ---
|
228 |
+
print("Discovering available runs...")
|
229 |
+
available_runs = discover_available_runs()
|
230 |
+
if not available_runs:
|
231 |
+
print("No runs found! Please check the runs/ directory structure.")
|
232 |
+
exit(1)
|
|
|
|
|
233 |
|
234 |
+
# Get the latest run as default
|
235 |
+
latest_run = available_runs[0]
|
236 |
+
print(f"Found {len(available_runs)} run(s). Latest: {latest_run['title']}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
237 |
|
238 |
+
# Initialize with latest run data
|
239 |
+
print("Loading latest run data...")
|
240 |
+
current_data = load_run_data(latest_run["run_id"])
|
241 |
+
print("Data loading complete.")
|
242 |
+
|
243 |
+
# --- Plotting Functions ---
|
244 |
+
def create_cost_scatter_plot(data: Dict[str, pd.DataFrame]) -> tuple:
|
245 |
+
"""Create the cost vs rank scatter plot."""
|
246 |
+
df_summary = data.get("summary", pd.DataFrame())
|
247 |
+
|
248 |
+
if df_summary.empty or 'AutoBench' not in df_summary.columns or NEW_COST_COLUMN_SUMMARY not in df_summary.columns:
|
249 |
+
return None, "_(Insufficient data for Rank vs Cost plot)_"
|
250 |
+
|
251 |
+
plot_df = df_summary.dropna(subset=['AutoBench', NEW_COST_COLUMN_SUMMARY, 'Model']).copy()
|
252 |
+
plot_df[NEW_COST_COLUMN_SUMMARY] = pd.to_numeric(plot_df[NEW_COST_COLUMN_SUMMARY], errors='coerce')
|
253 |
+
plot_df = plot_df.dropna(subset=[NEW_COST_COLUMN_SUMMARY])
|
254 |
+
|
255 |
+
if plot_df.empty:
|
256 |
+
return None, "_(No valid data for Rank vs Cost plot)_"
|
257 |
+
|
258 |
+
fig_cost = px.scatter(
|
259 |
+
plot_df,
|
260 |
+
x=NEW_COST_COLUMN_SUMMARY,
|
261 |
+
y="AutoBench",
|
262 |
+
text="Model",
|
263 |
+
log_x=True,
|
264 |
+
title="AutoBench Rank vs. Average Cost per Response ($ Cents - Log Scale)",
|
265 |
+
labels={'AutoBench': 'AutoBench Rank', NEW_COST_COLUMN_SUMMARY: 'Avg Cost ($ Cents) - Log Scale'},
|
266 |
+
hover_data=['Model', 'AutoBench', NEW_COST_COLUMN_SUMMARY, 'Avg Answer Duration (sec)']
|
267 |
+
)
|
268 |
+
fig_cost.update_traces(textposition='top center')
|
269 |
+
fig_cost.update_layout(
|
270 |
+
xaxis_title="Avg Cost ($ Cents) - Log Scale",
|
271 |
+
yaxis_title="AutoBench Rank",
|
272 |
+
width=1000,
|
273 |
+
height=800,
|
274 |
+
xaxis2=dict(
|
275 |
+
overlaying='x',
|
276 |
+
matches='x',
|
277 |
+
side='top',
|
278 |
+
showticklabels=True,
|
279 |
+
showline=True,
|
280 |
+
title=None
|
281 |
+
)
|
282 |
+
)
|
283 |
+
return fig_cost, ""
|
284 |
+
|
285 |
+
def create_avg_latency_plot(data: Dict[str, pd.DataFrame]) -> tuple:
|
286 |
+
"""Create the average latency vs rank scatter plot."""
|
287 |
+
df_summary = data.get("summary", pd.DataFrame())
|
288 |
+
|
289 |
+
if df_summary.empty or 'AutoBench' not in df_summary.columns or 'Avg Answer Duration (sec)' not in df_summary.columns:
|
290 |
+
return None, "_(Insufficient data for Rank vs Avg Latency plot)_"
|
291 |
+
|
292 |
+
plot_df = df_summary.dropna(subset=['AutoBench', 'Avg Answer Duration (sec)', 'Model']).copy()
|
293 |
+
plot_df['Avg Answer Duration (sec)'] = pd.to_numeric(plot_df['Avg Answer Duration (sec)'], errors='coerce')
|
294 |
+
plot_df = plot_df.dropna(subset=['Avg Answer Duration (sec)'])
|
295 |
+
|
296 |
+
if plot_df.empty:
|
297 |
+
return None, "_(No valid data for Rank vs Avg Latency plot)_"
|
298 |
+
|
299 |
+
fig_latency = px.scatter(
|
300 |
+
plot_df,
|
301 |
+
x="Avg Answer Duration (sec)",
|
302 |
+
y="AutoBench",
|
303 |
+
text="Model",
|
304 |
+
log_x=True,
|
305 |
+
title="AutoBench Rank vs. Average Latency (Log Scale)",
|
306 |
+
labels={'AutoBench': 'AutoBench Rank', 'Avg Answer Duration (sec)': 'Avg Latency (s) - Log Scale'},
|
307 |
+
hover_data=['Model', 'AutoBench', 'Avg Answer Duration (sec)', NEW_COST_COLUMN_SUMMARY]
|
308 |
+
)
|
309 |
+
fig_latency.update_traces(textposition='top center')
|
310 |
+
fig_latency.update_layout(
|
311 |
+
xaxis_title="Avg Latency (s) - Log Scale",
|
312 |
+
yaxis_title="AutoBench Rank",
|
313 |
+
width=1000,
|
314 |
+
height=800
|
315 |
+
)
|
316 |
+
return fig_latency, ""
|
317 |
+
|
318 |
+
def create_p99_latency_plot(data: Dict[str, pd.DataFrame]) -> tuple:
|
319 |
+
"""Create the P99 latency vs rank scatter plot."""
|
320 |
+
df_summary = data.get("summary", pd.DataFrame())
|
321 |
+
|
322 |
+
if df_summary.empty or 'AutoBench' not in df_summary.columns or 'P99 Answer Duration (sec)' not in df_summary.columns:
|
323 |
+
return None, "_(Insufficient data for Rank vs P99 Latency plot)_"
|
324 |
+
|
325 |
+
plot_df = df_summary.dropna(subset=['AutoBench', 'P99 Answer Duration (sec)', 'Model']).copy()
|
326 |
+
plot_df['P99 Answer Duration (sec)'] = pd.to_numeric(plot_df['P99 Answer Duration (sec)'], errors='coerce')
|
327 |
+
plot_df = plot_df.dropna(subset=['P99 Answer Duration (sec)'])
|
328 |
+
|
329 |
+
if plot_df.empty:
|
330 |
+
return None, "_(No valid data for Rank vs P99 Latency plot)_"
|
331 |
+
|
332 |
+
fig_p99 = px.scatter(
|
333 |
+
plot_df,
|
334 |
+
x="P99 Answer Duration (sec)",
|
335 |
+
y="AutoBench",
|
336 |
+
text="Model",
|
337 |
+
log_x=True,
|
338 |
+
title="AutoBench Rank vs. P99 Latency (Log Scale)",
|
339 |
+
labels={'AutoBench': 'AutoBench Rank', 'P99 Answer Duration (sec)': 'P99 Latency (s) - Log Scale'},
|
340 |
+
hover_data=['Model', 'AutoBench', 'P99 Answer Duration (sec)', 'Avg Answer Duration (sec)', NEW_COST_COLUMN_SUMMARY]
|
341 |
+
)
|
342 |
+
fig_p99.update_traces(textposition='top center')
|
343 |
+
fig_p99.update_layout(
|
344 |
+
xaxis_title="P99 Latency (s) - Log Scale",
|
345 |
+
yaxis_title="AutoBench Rank",
|
346 |
+
width=1000,
|
347 |
+
height=800
|
348 |
+
)
|
349 |
+
return fig_p99, ""
|
350 |
+
|
351 |
+
def update_leaderboard_data(selected_run_id: str) -> tuple:
|
352 |
+
"""Update all leaderboard components when run selection changes."""
|
353 |
+
if not selected_run_id:
|
354 |
+
# Return empty/default values for all outputs
|
355 |
+
empty_df = pd.DataFrame()
|
356 |
+
return (
|
357 |
+
empty_df, empty_df, empty_df, empty_df, empty_df, empty_df, # DataFrames
|
358 |
+
None, "", None, "", None, "", # Plots and messages
|
359 |
+
"No run selected", "" # Info message, correlations text
|
360 |
+
)
|
361 |
+
|
362 |
+
# Load data for selected run
|
363 |
+
data = load_run_data(selected_run_id)
|
364 |
+
run_metadata = load_run_metadata(selected_run_id)
|
365 |
+
|
366 |
+
if not data:
|
367 |
+
empty_df = pd.DataFrame()
|
368 |
+
return (
|
369 |
+
empty_df, empty_df, empty_df, empty_df, empty_df, empty_df,
|
370 |
+
None, "Error loading data", None, "Error loading data", None, "Error loading data",
|
371 |
+
f"Error loading run: {selected_run_id}", ""
|
372 |
+
)
|
373 |
+
|
374 |
+
# Get DataFrames
|
375 |
+
summary_display = data.get("summary_display", pd.DataFrame())
|
376 |
+
benchmark_display = data.get("benchmark_display", pd.DataFrame())
|
377 |
+
cost_df = data.get("cost", pd.DataFrame())
|
378 |
+
avg_latency_df = data.get("avg_latency", pd.DataFrame())
|
379 |
+
p99_latency_df = data.get("p99_latency", pd.DataFrame())
|
380 |
+
domain_df = data.get("domain", pd.DataFrame())
|
381 |
+
|
382 |
+
# Create rank display (rename AutoBench to Rank for overall ranking tab)
|
383 |
+
overall_rank_display = summary_display.copy()
|
384 |
+
if 'AutoBench' in overall_rank_display.columns:
|
385 |
+
overall_rank_display.rename(columns={'AutoBench': 'Rank'}, inplace=True)
|
386 |
+
|
387 |
+
# Prepare cost and latency displays with model_name first
|
388 |
+
def prepare_table_display(df, model_col='model_name'):
|
389 |
+
if df.empty:
|
390 |
+
return df
|
391 |
+
if model_col in df.columns:
|
392 |
+
cols = [model_col] + [col for col in df.columns if col != model_col]
|
393 |
+
return df[cols]
|
394 |
+
return df
|
395 |
+
|
396 |
+
cost_display = prepare_table_display(cost_df)
|
397 |
+
avg_latency_display = prepare_table_display(avg_latency_df)
|
398 |
+
p99_latency_display = prepare_table_display(p99_latency_df)
|
399 |
+
|
400 |
+
# Prepare domain display
|
401 |
+
domain_display = domain_df.copy()
|
402 |
+
if 'Model Name' in domain_display.columns:
|
403 |
+
cols = ['Model Name'] + [col for col in domain_display.columns if col != 'Model Name']
|
404 |
+
domain_display = domain_display[cols]
|
405 |
+
|
406 |
+
# Create plots
|
407 |
+
cost_plot, cost_msg = create_cost_scatter_plot(data)
|
408 |
+
avg_latency_plot, avg_latency_msg = create_avg_latency_plot(data)
|
409 |
+
p99_latency_plot, p99_latency_msg = create_p99_latency_plot(data)
|
410 |
+
|
411 |
+
# Create info message
|
412 |
+
info_msg = f"**Current Run:** {run_metadata.get('title', 'Unknown')} ({run_metadata.get('date', 'Unknown date')})"
|
413 |
+
if 'model_count' in run_metadata:
|
414 |
+
info_msg += f" - {run_metadata['model_count']} models"
|
415 |
+
|
416 |
+
# Get correlation text
|
417 |
+
correlations_text = format_correlations_text(data.get("correlations", {}))
|
418 |
+
|
419 |
+
return (
|
420 |
+
overall_rank_display, benchmark_display, cost_display, avg_latency_display, p99_latency_display, domain_display,
|
421 |
+
cost_plot, cost_msg, avg_latency_plot, avg_latency_msg, p99_latency_plot, p99_latency_msg,
|
422 |
+
info_msg, correlations_text
|
423 |
+
)
|
424 |
|
425 |
# --- Build Gradio App ---
|
426 |
with gr.Blocks(theme=gr.themes.Soft()) as app:
|
427 |
gr.Markdown("# AutoBench LLM Leaderboard")
|
428 |
gr.Markdown(
|
429 |
"Interactive leaderboard for AutoBench, where LLMs rank LLMs' responses. "
|
430 |
+
"Includes performance, cost, and latency metrics. "
|
431 |
+
"Use the dropdown below to navigate between different benchmark runs."
|
432 |
+
)
|
433 |
+
|
434 |
+
# --- Navigation Section ---
|
435 |
+
with gr.Row():
|
436 |
+
with gr.Column(scale=3):
|
437 |
+
# Create dropdown choices
|
438 |
+
run_choices = [(f"{run['date']} - {run['title']}", run['run_id']) for run in available_runs]
|
439 |
+
run_selector = gr.Dropdown(
|
440 |
+
choices=run_choices,
|
441 |
+
value=latest_run["run_id"],
|
442 |
+
label="📊 Select AutoBench Run",
|
443 |
+
info="Choose a benchmark run to view its results"
|
444 |
+
)
|
445 |
+
with gr.Column(scale=2):
|
446 |
+
current_run_info = gr.Markdown(
|
447 |
+
f"**Current Run:** {latest_run['title']} ({latest_run['date']})" +
|
448 |
+
(f" - {latest_run['model_count']} models" if 'model_count' in latest_run else "")
|
449 |
+
)
|
450 |
+
|
451 |
+
gr.Markdown("---")
|
452 |
|
453 |
# --- Tab 1: Overall Ranking ---
|
454 |
with gr.Tab("Overall Ranking"):
|
455 |
gr.Markdown("## Overall Model Performance")
|
456 |
+
gr.Markdown("Models ranked by AutoBench score. Lower cost ($ Cents), latency (s), and fail rate (%) are better. Iterations shows the number of evaluations per model.")
|
457 |
+
|
458 |
+
# Add correlations display
|
459 |
+
initial_correlations = format_correlations_text(current_data.get("correlations", {}))
|
460 |
+
correlations_display = gr.Markdown(value=initial_correlations)
|
461 |
+
|
462 |
+
overall_ranking_table = gr.DataFrame(
|
463 |
+
current_data.get("summary_display", pd.DataFrame()).copy().rename(columns={'AutoBench': 'Rank'}) if 'AutoBench' in current_data.get("summary_display", pd.DataFrame()).columns else current_data.get("summary_display", pd.DataFrame()),
|
464 |
+
interactive=True,
|
465 |
+
label="Overall Rankings"
|
466 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
467 |
|
468 |
+
# --- Tab 2: Benchmark Comparison ---
|
469 |
with gr.Tab("Benchmark Comparison"):
|
470 |
gr.Markdown("## Benchmark Comparison")
|
471 |
gr.Markdown("Comparison of AutoBench scores with other popular benchmarks. AutoBench features 82.51% correlation with Chatbot Arena, 83.74% with Artificial Analysis Intelligence Index, and 71.51% with MMLU. Models sorted by AutoBench score.")
|
472 |
+
|
473 |
+
benchmark_comparison_table = gr.DataFrame(
|
474 |
+
current_data.get("benchmark_display", pd.DataFrame()),
|
475 |
+
interactive=True,
|
476 |
+
label="Benchmark Comparison"
|
477 |
+
)
|
|
|
|
|
478 |
|
479 |
+
# --- Tab 3: Performance Plots ---
|
480 |
with gr.Tab("Performance Plots"):
|
481 |
gr.Markdown("## Performance Visualizations")
|
482 |
gr.Markdown("Exploring relationships between AutoBench Rank, Latency, and Cost.")
|
483 |
|
484 |
+
# Scatter Plot 1: Cost vs Rank
|
485 |
gr.Markdown("### Rank vs. Average Cost")
|
486 |
+
initial_cost_plot, initial_cost_msg = create_cost_scatter_plot(current_data)
|
487 |
+
cost_plot = gr.Plot(value=initial_cost_plot)
|
488 |
+
cost_plot_msg = gr.Markdown(value=initial_cost_msg)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
489 |
|
490 |
# Plot 2: Rank vs Average Latency
|
491 |
gr.Markdown("### Rank vs. Average Latency")
|
492 |
+
initial_avg_latency_plot, initial_avg_latency_msg = create_avg_latency_plot(current_data)
|
493 |
+
avg_latency_plot = gr.Plot(value=initial_avg_latency_plot)
|
494 |
+
avg_latency_plot_msg = gr.Markdown(value=initial_avg_latency_msg)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
495 |
|
496 |
# Plot 3: Rank vs P99 Latency
|
497 |
gr.Markdown("### Rank vs. P99 Latency")
|
498 |
+
initial_p99_latency_plot, initial_p99_latency_msg = create_p99_latency_plot(current_data)
|
499 |
+
p99_latency_plot = gr.Plot(value=initial_p99_latency_plot)
|
500 |
+
p99_latency_plot_msg = gr.Markdown(value=initial_p99_latency_msg)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
501 |
|
502 |
+
# --- Tab 4: Cost & Latency Analysis ---
|
503 |
with gr.Tab("Cost & Latency Analysis"):
|
504 |
gr.Markdown("## Performance vs. Cost/Latency Trade-offs")
|
505 |
|
506 |
# Cost Breakdown Table
|
507 |
+
gr.Markdown("### Cost Breakdown per Domain ($ Cents/Response)")
|
508 |
+
cost_df = current_data.get("cost", pd.DataFrame())
|
509 |
+
if not cost_df.empty and 'model_name' in cost_df.columns:
|
510 |
+
cols = ['model_name'] + [col for col in cost_df.columns if col != 'model_name']
|
511 |
+
initial_cost_display = cost_df[cols]
|
|
|
|
|
|
|
|
|
512 |
else:
|
513 |
+
initial_cost_display = cost_df
|
514 |
+
cost_breakdown_table = gr.DataFrame(
|
515 |
+
value=initial_cost_display,
|
516 |
+
interactive=True,
|
517 |
+
label="Cost Breakdown"
|
518 |
+
)
|
519 |
|
520 |
# Latency Breakdown Tables
|
521 |
gr.Markdown("### Average Latency Breakdown per Domain (Seconds)")
|
522 |
+
avg_latency_df = current_data.get("avg_latency", pd.DataFrame())
|
523 |
+
if not avg_latency_df.empty and 'model_name' in avg_latency_df.columns:
|
524 |
+
cols = ['model_name'] + [col for col in avg_latency_df.columns if col != 'model_name']
|
525 |
+
initial_avg_latency_display = avg_latency_df[cols]
|
|
|
|
|
|
|
526 |
else:
|
527 |
+
initial_avg_latency_display = avg_latency_df
|
528 |
+
avg_latency_breakdown_table = gr.DataFrame(
|
529 |
+
value=initial_avg_latency_display,
|
530 |
+
interactive=True,
|
531 |
+
label="Average Latency Breakdown"
|
532 |
+
)
|
533 |
|
534 |
gr.Markdown("### P99 Latency Breakdown per Domain (Seconds)")
|
535 |
+
p99_latency_df = current_data.get("p99_latency", pd.DataFrame())
|
536 |
+
if not p99_latency_df.empty and 'model_name' in p99_latency_df.columns:
|
537 |
+
cols = ['model_name'] + [col for col in p99_latency_df.columns if col != 'model_name']
|
538 |
+
initial_p99_latency_display = p99_latency_df[cols]
|
|
|
|
|
|
|
539 |
else:
|
540 |
+
initial_p99_latency_display = p99_latency_df
|
541 |
+
p99_latency_breakdown_table = gr.DataFrame(
|
542 |
+
value=initial_p99_latency_display,
|
543 |
+
interactive=True,
|
544 |
+
label="P99 Latency Breakdown"
|
545 |
+
)
|
546 |
|
547 |
|
548 |
+
# --- Tab 5: Domain Performance ---
|
549 |
with gr.Tab("Domain Performance"):
|
550 |
gr.Markdown("## Performance Across Different Domains")
|
551 |
gr.Markdown("Model ranks within specific knowledge or task areas. Higher is better.")
|
552 |
+
|
553 |
+
domain_df = current_data.get("domain", pd.DataFrame())
|
554 |
+
if not domain_df.empty and 'Model Name' in domain_df.columns:
|
555 |
+
cols = ['Model Name'] + [col for col in domain_df.columns if col != 'Model Name']
|
556 |
+
initial_domain_display = domain_df[cols]
|
|
|
|
|
|
|
557 |
else:
|
558 |
+
initial_domain_display = domain_df
|
559 |
+
domain_performance_table = gr.DataFrame(
|
560 |
+
value=initial_domain_display,
|
561 |
+
interactive=True,
|
562 |
+
label="Domain Performance"
|
563 |
+
)
|
564 |
|
565 |
# --- Tab 5: About ---
|
566 |
with gr.Tab("About AutoBench"):
|
|
|
592 |
|
593 |
**Disclaimer:** Benchmark results provide one perspective on model capabilities. Performance can vary based on specific tasks, prompts, and API conditions. Costs are estimates and subject to change by providers. Latency depends on server load and geographic location.
|
594 |
""")
|
595 |
+
|
596 |
+
# --- Event Handlers ---
|
597 |
+
# Set up reactive data loading when run selection changes
|
598 |
+
run_selector.change(
|
599 |
+
fn=update_leaderboard_data,
|
600 |
+
inputs=[run_selector],
|
601 |
+
outputs=[
|
602 |
+
overall_ranking_table,
|
603 |
+
benchmark_comparison_table,
|
604 |
+
cost_breakdown_table,
|
605 |
+
avg_latency_breakdown_table,
|
606 |
+
p99_latency_breakdown_table,
|
607 |
+
domain_performance_table,
|
608 |
+
cost_plot,
|
609 |
+
cost_plot_msg,
|
610 |
+
avg_latency_plot,
|
611 |
+
avg_latency_plot_msg,
|
612 |
+
p99_latency_plot,
|
613 |
+
p99_latency_plot_msg,
|
614 |
+
current_run_info,
|
615 |
+
correlations_display
|
616 |
+
]
|
617 |
+
)
|
618 |
+
|
619 |
+
# Note: Initial data is already loaded via value parameters above
|
620 |
|
621 |
# --- Launch the App ---
|
622 |
print("Launching Gradio app...")
|
623 |
+
app.launch(
|
624 |
+
favicon_path="static/manifest.json" if os.path.exists("static/manifest.json") else None,
|
625 |
+
show_error=True
|
626 |
+
)
|
627 |
print("Gradio app launched.")
|
{data → runs/run_2025-04-25}/avg_latency.csv
RENAMED
File without changes
|
runs/run_2025-04-25/correlations.json
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"correlations": {
|
3 |
+
"Chatbot Arena": 82.51,
|
4 |
+
"Artificial Analysis Intelligence Index": 83.74,
|
5 |
+
"MMLU-Plus": 71.51
|
6 |
+
},
|
7 |
+
"description": "Correlation percentages between AutoBench scores and other benchmark scores"
|
8 |
+
}
|
{data → runs/run_2025-04-25}/cost_data.csv
RENAMED
File without changes
|
{data → runs/run_2025-04-25}/domain_ranks.csv
RENAMED
File without changes
|
runs/run_2025-04-25/metadata.json
ADDED
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"run_id": "run_2025-04-25",
|
3 |
+
"title": "AutoBench Run 2 - April 2025",
|
4 |
+
"date": "2025-04-25",
|
5 |
+
"description": "Second major AutoBench run with o4-mini, GPT-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thinking, etc.",
|
6 |
+
"blog_url": "https://huggingface.co/blog/PeterKruger/autobench-2nd-run",
|
7 |
+
"model_count": 27,
|
8 |
+
"is_latest": false
|
9 |
+
}
|
{data → runs/run_2025-04-25}/p99_latency.csv
RENAMED
File without changes
|
{data → runs/run_2025-04-25}/summary_data.csv
RENAMED
File without changes
|
runs/run_2025-08-14/avg_latency.csv
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
model_name,coding,creative writing,current news,general culture,grammar,history,logics,math,science,technology,Average (All Topics)
|
2 |
+
claude-3.5-haiku,17.2096,9.5021,13.1592,11.2295,9.2722,13.7099,7.9069,9.5013,10.8395,11.2754,11.51902452
|
3 |
+
claude-opus-4-1,97.988,31.7501,44.9092,36.3513,30.166,53.8647,38.4643,32.1546,48.1318,52.2831,48.62490598
|
4 |
+
claude-sonnet-4,75.4533,18.119,34.977,23.5653,18.9687,35.1883,19.9602,24.1877,36.5889,34.5893,33.66639032
|
5 |
+
deepSeek-R1-0528,238.2024,38.1537,62.9681,48.2227,51.7076,61.9942,302.2976,271.5234,62.5406,66.1298,119.174235
|
6 |
+
deepSeek-V3-0324,55.0977,17.8165,32.0701,23.041,24.2812,31.001,88.9007,51.2641,29.3776,43.4225,40.30336432
|
7 |
+
gemini-2.5-flash,95.8841,16.7289,32.1003,18.3471,24.8943,29.1208,133.8567,52.6201,31.5185,36.1925,48.7078753
|
8 |
+
gemini-2.5-flash-lite,26.1777,2.8563,5.6249,3.6215,3.9845,6.8374,86.5258,41.5112,6.4054,9.5453,19.15509939
|
9 |
+
gemini-2.5-pro,83.393,29.2572,49.1594,36.6978,36.9932,47.2089,166.9207,93.4631,49.4694,52.5929,65.03115036
|
10 |
+
gemma-3-27b-it,62.2708,12.4686,27.5139,18.3155,19.2946,25.0176,24.3704,40.4719,35.3548,23.2773,29.7215
|
11 |
+
GLM-4.5,147.0394,20.9164,43.4854,31.8055,38.4103,42.5121,224.8967,165.5218,43.1608,49.7589,80.74437254
|
12 |
+
GLM-4.5-Air,105.9142,13.7172,31.2173,19.5371,45.6936,30.9208,206.3031,140.9465,38.1786,44.156,68.34050587
|
13 |
+
gpt-4.1,58.0164,11.4923,24.717,14.4706,15.9619,23.1579,80.4132,46.8127,21.9879,23.3983,32.86274006
|
14 |
+
gpt-5,120.9672,50.0373,78.8355,55.0956,56.6508,73.3966,156.5955,151.2006,83.2029,76.5455,89.99818067
|
15 |
+
gpt-5-mini,102.5153,25.7197,48.4212,31.5236,35.4217,44.406,159.7168,84.2748,52.1692,56.7156,65.89701176
|
16 |
+
gpt-5-nano,98.4218,38.2174,52.23,38.2242,48.2844,54.4789,136.6573,86.4832,48.3263,53.912,66.4959839
|
17 |
+
gpt-oss-120b,49.9796,11.3648,28.9885,18.2911,14.4024,22.6241,25.4515,36.6005,27.8547,28.8135,27.00733404
|
18 |
+
grok-3-mini,45.7707,12.9635,17.7188,13.6096,20.4385,20.4502,40.5232,32.54,25.8136,23.9,26.12147499
|
19 |
+
grok-4,92.6961,28.4581,49.7663,33.7181,41.4523,48.4394,138.4335,124.3858,48.7183,50.103,60.95525411
|
20 |
+
Kimi-K2-Instruct,69.4739,29.2635,65.4032,45.9564,46.6415,75.2439,114.3645,58.6696,72.5513,57.1711,65.0222057
|
21 |
+
llama-3_1-Nemotron-Ultra-253B-v1,88.5095,28.8905,29.2174,22.7454,32.2121,36.1473,174.681,134.8929,38.2931,32.3295,61.53657957
|
22 |
+
Llama-3_3-Nemotron-Super-49B-v1,46.3574,15.1811,28.7779,22.9057,21.5446,27.2551,67.4261,33.0439,31.8722,25.0215,32.63831081
|
23 |
+
llama-4-maverick,21.5707,4.76,7.8189,5.8389,6.1681,8.3187,21.0604,11.3275,8.7158,6.8338,10.65014104
|
24 |
+
llama-4-Scout-17B-16E-Instruct,20.2177,6.4339,9.1924,7.529,7.9851,9.8133,11.1171,13.0721,12.2545,8.2341,10.86684261
|
25 |
+
magistral-small-2506,11.4551,7.1952,7.5178,5.7532,6.1051,8.6988,79.2722,37.9617,7.0901,7.2786,17.53939687
|
26 |
+
mistral-large-2411,51.7739,14.2815,23.6025,17.3517,13.3736,25.8432,18.1355,24.6234,25.1484,21.9005,24.36368715
|
27 |
+
nova-lite-v1,7.1014,4.7882,5.846,4.7061,4.3402,5.5093,4.8806,4.861,4.9134,5.3275,5.288625128
|
28 |
+
nova-pro-v1,12.4833,7.5838,7.52,5.6658,6.5254,7.3418,5.8645,7.2712,6.6838,6.7792,7.528192069
|
29 |
+
o3,70.4202,25.9427,46.4619,29.613,26.5293,42.6644,194.8085,112.9362,41.2548,46.7826,63.89621339
|
30 |
+
o4-mini,56.98,16.3976,26.7274,19.8134,21.7084,23.2641,116.3436,41.5349,28.6513,26.4233,39.05469579
|
31 |
+
phi-4,10.8373,5.9498,6.7808,6.3085,5.9981,7.1457,7.7569,12.1669,7.0431,7.4096,7.744667446
|
32 |
+
Qwen3-14B,67.7342,19.9239,31.3204,32.178,32.2363,31.2024,197.4205,132.5492,40.1656,31.5221,61.11544056
|
33 |
+
Qwen3-235B-A22B-Thinking-2507,180.1429,33.7386,65.2237,45.3004,54.6603,53.7109,122.6611,138.2941,60.427,72.9058,78.79346155
|
34 |
+
Qwen3-30B-A3B,119.9895,27.907,34.7837,25.8461,38.8109,37.0577,204.2344,157.6709,38.1969,41.6341,72.64171253
|
runs/run_2025-08-14/correlations.json
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"correlations": {
|
3 |
+
"LMArena": 86.85,
|
4 |
+
"Artificial Analysis Intelligence Index": 92.17,
|
5 |
+
"MMLU": 75.44
|
6 |
+
},
|
7 |
+
"description": "Correlation percentages between AutoBench scores and other benchmark scores"
|
8 |
+
}
|
runs/run_2025-08-14/cost_data.csv
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
model_name,coding,creative writing,current news,general culture,grammar,history,logics,math,science,technology,Average (All Topics)
|
2 |
+
claude-3.5-haiku,0.01433364,0.00596023,0.00783403,0.00687656,0.00700846,0.00828981,0.00670457,0.0086506,0.00762436,0.00770914,0.008262832
|
3 |
+
claude-opus-4-1,0.1854432,0.058055,0.07953592,0.07093359,0.06338077,0.091515,0.07756427,0.08967857,0.08113929,0.08624571,0.091256434
|
4 |
+
claude-sonnet-4,0.03735534,0.00986831,0.01471113,0.01164281,0.01133654,0.01593981,0.01523071,0.0181211,0.01523195,0.01548664,0.017099466
|
5 |
+
deepSeek-R1-0528,0.01314631,0.00240864,0.00361531,0.0028925,0.00321982,0.00348657,0.01360484,0.01570098,0.00335462,0.00345558,0.006382309
|
6 |
+
deepSeek-V3-0324,0.00179803,0.00076659,0.0010843,0.00076717,0.0009645,0.00102303,0.00176716,0.00161885,0.00101519,0.00100663,0.00119639
|
7 |
+
gemini-2.5-flash,0.01009034,0.00149757,0.00386032,0.00235404,0.00327702,0.00373479,0.00399646,0.00651437,0.0042002,0.00425649,0.004512314
|
8 |
+
gemini-2.5-flash-lite,0.00162929,0.00022925,0.00043297,0.0002676,0.00036001,0.00074135,0.00284197,0.00300094,0.00046755,0.00076045,0.001052718
|
9 |
+
gemini-2.5-pro,0.0277487,0.0072651,0.01509773,0.01122996,0.01217529,0.01489145,0.01518476,0.02214276,0.0161067,0.01515628,0.015866994
|
10 |
+
gemma-3-27b-it,0.00044567,0.00017776,0.00028088,0.00018688,0.00021991,0.00027132,0.00025736,0.0004104,0.00026623,0.00026777,0.00028134
|
11 |
+
GLM-4.5,0.01215121,0.00224537,0.00373941,0.00300118,0.00374437,0.0037643,0.01209785,0.01441371,0.00379349,0.00407664,0.00629521
|
12 |
+
GLM-4.5-Air,0.0064693,0.00115415,0.00196991,0.0013692,0.00310372,0.00188536,0.00667506,0.00810732,0.00243774,0.00270236,0.003611243
|
13 |
+
gpt-4.1,0.0127812,0.00475815,0.00799795,0.00526612,0.00723564,0.00747097,0.01396449,0.01694627,0.007576,0.0074939,0.009144648
|
14 |
+
gpt-5,0.06007105,0.02747785,0.03786611,0.03067602,0.03328859,0.03684483,0.0619945,0.07589079,0.03760841,0.03823268,0.043676351
|
15 |
+
gpt-5-mini,0.00837399,0.00360921,0.00562431,0.00423233,0.00503336,0.00521421,0.00885311,0.01098604,0.00557968,0.00584673,0.006324841
|
16 |
+
gpt-5-nano,0.00322071,0.00188066,0.00179711,0.00158044,0.00258222,0.00217255,0.00350634,0.00381871,0.00171226,0.00193456,0.002414388
|
17 |
+
gpt-oss-120b,0.00204009,0.00075235,0.0013959,0.00103317,0.00088107,0.00124919,0.00158693,0.00206557,0.00120019,0.00136573,0.001361942
|
18 |
+
grok-3-mini,0.00135005,0.00053652,0.000744,0.00052427,0.00074544,0.00073632,0.00146137,0.00130681,0.00070784,0.00073552,0.000895661
|
19 |
+
grok-4,0.0510268,0.01488985,0.02296405,0.01615106,0.02757851,0.02201092,0.05122846,0.05969514,0.02105693,0.02263429,0.029202342
|
20 |
+
Kimi-K2-Instruct,0.0028794,0.00174574,0.00235794,0.00221802,0.00168557,0.00288834,0.00275612,0.00216027,0.00241402,0.00230324,0.002379296
|
21 |
+
llama-3_1-Nemotron-Ultra-253B-v1,0.00518958,0.00210383,0.00166994,0.00144186,0.00232614,0.00206072,0.00771408,0.0084011,0.00209152,0.00188066,0.003451924
|
22 |
+
Llama-3_3-Nemotron-Super-49B-v1,0.00061338,0.00029891,0.0003921,0.00035633,0.00037593,0.00042511,0.00067755,0.00059406,0.00041489,0.00037202,0.000455359
|
23 |
+
llama-4-maverick,0.00073583,0.0003091,0.00044116,0.00034576,0.0004487,0.00045704,0.00067601,0.00068581,0.00041993,0.00040759,0.000496574
|
24 |
+
llama-4-Scout-17B-16E-Instruct,0.00052871,0.00029487,0.00041163,0.00037054,0.00037521,0.00042884,0.00043854,0.00049328,0.00037896,0.00036888,0.000410484
|
25 |
+
magistral-small-2506,0.00209764,0.00099426,0.00100711,0.00073844,0.00105969,0.00101964,0.00613572,0.00519953,0.0010145,0.00099893,0.0019781
|
26 |
+
mistral-large-2411,0.00963771,0.004614,0.00608889,0.00459562,0.00456308,0.00608816,0.00568067,0.0072976,0.00577418,0.00586219,0.006101138
|
27 |
+
nova-lite-v1,0.00028518,0.00013034,0.0001725,0.00012972,0.00014658,0.00016856,0.00020718,0.0002494,0.00016274,0.00015855,0.00018322
|
28 |
+
nova-pro-v1,0.00286661,0.00146443,0.00169594,0.00126732,0.00182435,0.00161185,0.00167262,0.00247683,0.00151393,0.00145992,0.001800498
|
29 |
+
o3,0.01830136,0.00943369,0.01515805,0.01014862,0.01051015,0.0133993,0.03758871,0.05157964,0.01258745,0.01355057,0.018504711
|
30 |
+
o4-mini,0.01033712,0.00629079,0.00714936,0.00606561,0.00734834,0.00687444,0.01597855,0.01248218,0.0066754,0.00744339,0.008704364
|
31 |
+
phi-4,0.00034647,0.00019431,0.00021457,0.00017458,0.00019261,0.00022752,0.00024432,0.00038072,0.00021456,0.00020954,0.000240436
|
32 |
+
Qwen3-14B,0.00098626,0.00036762,0.00045077,0.00043898,0.00056254,0.00047318,0.0018592,0.00189101,0.0005008,0.00047842,0.000789184
|
33 |
+
Qwen3-235B-A22B-Thinking-2507,0.00447462,0.00278202,0.00439546,0.003571,0.0041131,0.00377739,0.00673892,0.00660976,0.00369867,0.00371842,0.00416518
|
34 |
+
Qwen3-30B-A3B,0.00120847,0.00039109,0.00046062,0.00038347,0.00054571,0.00046903,0.00150189,0.00185211,0.00046031,0.0004507,0.000763337
|
runs/run_2025-08-14/domain_ranks.csv
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
model_name,coding,creative writing,current news,general culture,grammar,history,logics,math,science,technology,Average (All Topics)
|
2 |
+
claude-3.5-haiku,3.4733,3.8572,3.7443,3.995,3.7371,3.8701,2.8162,2.7809,3.736,3.8134,3.586292962
|
3 |
+
claude-opus-4-1,4.2931,4.5071,4.3035,4.4302,4.2258,4.441,3.5738,3.5758,4.4164,4.4833,4.239909895
|
4 |
+
claude-sonnet-4,4.1894,4.3647,4.3026,4.3258,4.2532,4.3497,3.5475,3.4817,4.3953,4.3884,4.171968576
|
5 |
+
deepSeek-R1-0528,3.9481,4.3147,4.3493,4.4062,4.3139,4.4007,3.5649,3.6287,4.3876,4.4032,4.18112906
|
6 |
+
deepSeek-V3-0324,3.8756,4.1946,4.0724,4.0561,4.0467,4.0888,3.3667,3.4401,4.1442,4.0976,3.945669087
|
7 |
+
gemini-2.5-flash,4.4225,4.1694,4.3729,4.3287,4.3781,4.4165,4.0091,4.2283,4.4283,4.3877,4.32099389
|
8 |
+
gemini-2.5-flash-lite,4.1092,4.1468,4.0836,4.1655,4.0513,4.1563,3.3399,3.546,4.2419,4.1944,4.017202952
|
9 |
+
gemini-2.5-pro,4.5248,4.3916,4.4224,4.4873,4.4508,4.477,4.0868,4.2425,4.5154,4.516,4.416904571
|
10 |
+
gemma-3-27b-it,3.5655,4.2891,4.112,4.1677,3.973,4.1788,3.0395,3.0841,4.1903,4.1951,3.881640548
|
11 |
+
GLM-4.5,3.8921,4.2566,4.3827,4.4219,4.3206,4.4692,3.4848,3.4666,4.4781,4.4931,4.176558031
|
12 |
+
GLM-4.5-Air,3.8049,3.993,4.1921,4.193,4.0296,4.2921,3.4191,3.3372,4.2467,4.2717,3.98464018
|
13 |
+
gpt-4.1,4.2419,4.3243,4.2551,4.186,4.2089,4.2291,3.7372,3.7882,4.2677,4.3183,4.165890881
|
14 |
+
gpt-5,4.5821,4.5178,4.5866,4.6213,4.359,4.6423,4.2072,4.1718,4.6466,4.6634,4.511567341
|
15 |
+
gpt-5-mini,4.545,4.5442,4.4995,4.5239,4.442,4.5635,4.1788,4.2467,4.6203,4.6302,4.486571107
|
16 |
+
gpt-5-nano,4.4143,4.3848,4.3524,4.403,4.3042,4.4475,3.88,4.134,4.3676,4.5159,4.325926956
|
17 |
+
gpt-oss-120b,4.612,4.4161,4.5248,4.4229,4.4453,4.5703,4.162,4.2461,4.6282,4.634,4.479287977
|
18 |
+
grok-3-mini,4.0184,4.1848,4.1622,4.2055,4.1589,4.2079,3.5142,3.4923,4.2585,4.263,4.055940505
|
19 |
+
grok-4,4.3075,4.3543,4.3302,4.3823,4.3368,4.3983,4.0111,3.8461,4.4058,4.4033,4.308828831
|
20 |
+
Kimi-K2-Instruct,4.119,4.5362,4.2929,4.3457,4.1929,4.5009,3.3958,3.5199,4.4592,4.4092,4.177138663
|
21 |
+
llama-3_1-Nemotron-Ultra-253B-v1,3.7715,4.268,4.1844,4.2204,4.0862,4.2596,3.4729,3.434,4.2285,4.2351,4.020264345
|
22 |
+
Llama-3_3-Nemotron-Super-49B-v1,3.8343,4.0472,4.0449,4.1217,3.9872,4.1624,3.0436,3.2692,4.1175,4.1322,3.883310532
|
23 |
+
llama-4-maverick,3.5884,3.7355,3.7029,3.7833,3.8314,3.8241,3.1303,3.0961,3.8317,3.7879,3.640194992
|
24 |
+
llama-4-Scout-17B-16E-Instruct,3.3725,3.8585,3.6597,3.8316,3.8462,3.8386,3.0535,3.0033,3.8224,3.8368,3.614481399
|
25 |
+
magistral-small-2506,3.7448,3.2301,3.9232,3.8931,3.8409,3.9707,3.2159,3.2791,4.0028,3.941,3.713933337
|
26 |
+
mistral-large-2411,3.4967,3.9724,3.8286,3.9329,3.7992,3.919,3.1123,3.1132,3.9871,3.9484,3.714675671
|
27 |
+
nova-lite-v1,3.322,3.7767,3.7078,3.7683,3.5057,3.7565,2.9917,2.9507,3.8237,3.7503,3.538201832
|
28 |
+
nova-pro-v1,3.3633,3.8403,3.5455,3.723,3.5315,3.633,2.9588,2.8514,3.7492,3.6051,3.490835422
|
29 |
+
o3,4.4254,4.2963,4.5626,4.4951,4.3871,4.5722,3.9576,4.1618,4.579,4.6123,4.409851586
|
30 |
+
o4-mini,4.3056,4.2389,4.3587,4.3787,4.2565,4.3495,3.8986,3.8362,4.4838,4.5318,4.27410734
|
31 |
+
phi-4,3.4825,3.9651,3.7302,3.849,3.6624,3.8171,3.1286,3.1995,3.8704,3.8465,3.657791802
|
32 |
+
Qwen3-14B,3.833,4.2818,4.1127,4.0911,4.0235,4.149,3.4441,3.3376,4.2349,4.1606,3.976245179
|
33 |
+
Qwen3-235B-A22B-Thinking-2507,4.3112,4.4415,4.4798,4.5551,4.4452,4.5117,3.8366,3.9413,4.4376,4.5403,4.394399183
|
34 |
+
Qwen3-30B-A3B,3.8019,4.1652,4.0949,4.1569,3.9359,4.145,3.4857,3.3944,4.1322,4.1468,3.952481327
|
runs/run_2025-08-14/metadata.json
ADDED
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"run_id": "run_2025-08-14",
|
3 |
+
"title": "AutoBench Run 3 - August 2025",
|
4 |
+
"date": "2025-08-14",
|
5 |
+
"description": "Latest AutoBench run with enhanced metrics including evaluation iterations and fail rates",
|
6 |
+
"blog_url": "https://huggingface.co/blog/PeterKruger/autobench-3rd-run",
|
7 |
+
"model_count": 34,
|
8 |
+
"is_latest": true
|
9 |
+
}
|
runs/run_2025-08-14/p99_latency.csv
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
model_name,coding,creative writing,current news,general culture,grammar,history,logics,math,science,technology,Average (All Topics)
|
2 |
+
GLM-4.5,336.5838,46.1539,115.4247,61.5081,183.8738,94.3667,955.208,354.6674,147.6845,164.993,246.0464
|
3 |
+
GLM-4.5-Air,294.8103,49.5964,75.2865,47.3232,188.0317,122.7531,934.9063,326.1799,192.6793,173.4233,240.499
|
4 |
+
Kimi-K2-Instruct,328.1743,127.0559,498.215,155.9084,411.2174,374.8414,919.1317,342.6293,464.4898,283.0215,390.4685
|
5 |
+
Llama-3_3-Nemotron-Super-49B-v1,215.9947,28.1436,78.589,55.0476,69.6513,64.6465,672.344,82.6381,150.7472,96.6115,151.4413
|
6 |
+
Qwen3-14B,291.1851,58.0622,88.5603,117.7344,88.0704,115.429,952.7636,353.0349,203.2735,124.2363,239.235
|
7 |
+
Qwen3-235B-A22B-Thinking-2507,666.7428,62.6109,184.4234,141.1792,188.1752,130.8262,431.5208,488.4891,231.8124,312.6447,283.8425
|
8 |
+
Qwen3-30B-A3B,302.3182,64.1154,92.9827,77.8548,121.5989,100.3155,973.6302,352.3788,165.0523,180.4958,243.0743
|
9 |
+
claude-3.5-haiku,41.3124,14.9481,33.5752,18.5466,17.7297,36.6532,15.8231,40.0595,17.8959,16.9296,25.3473
|
10 |
+
claude-opus-4-1,411.8235,66.277,99.7769,67.8099,73.13,140.073,240.5076,85.4884,194.4168,172.176,155.1479
|
11 |
+
claude-sonnet-4,372.4893,48.2756,92.4428,50.91,54.3981,124.7965,53.7468,57.4402,181.6048,159.8677,119.5972
|
12 |
+
deepSeek-R1-0528,516.7523,74.2918,114.0859,72.9274,112.0078,127.0512,839.4571,432.4486,182.3988,184.1239,265.5545
|
13 |
+
deepSeek-V3-0324,186.8884,51.2986,88.1105,69.8525,65.7204,121.314,755.2438,202.0879,100.6333,355.9609,199.711
|
14 |
+
gemini-2.5-flash,712.3231,38.3328,103.7981,43.5623,74.993,117.5583,944.4005,122.8572,135.3642,147.9211,244.1111
|
15 |
+
gemini-2.5-flash-lite,190.8247,6.2927,19.3252,12.1569,13.2804,47.8915,602.4712,222.2597,49.1969,110.7355,127.4435
|
16 |
+
gemini-2.5-pro,240.7699,52.3828,97.3472,57.9532,71.3161,111.1321,714.4278,300.57,170.1939,177.3255,199.3419
|
17 |
+
gemma-3-27b-it,375.5933,26.3165,60.8123,46.1448,66.8804,99.0397,180.0912,228.1774,193.2272,68.8631,134.5146
|
18 |
+
gpt-4.1,373.0435,20.9727,103.6544,34.6418,45.1164,68.0267,580.148,268.4173,151.755,161.6432,180.7419
|
19 |
+
gpt-5,379.0229,104.1814,151.0171,109.2785,163.0493,141.3236,655.5064,536.5325,304.2283,232.5815,277.6722
|
20 |
+
gpt-5-mini,420.1856,55.5139,107.6471,65.403,94.1406,96.4831,710.367,304.143,221.4093,238.5054,231.3798
|
21 |
+
gpt-5-nano,452.6803,63.8721,123.3952,95.2713,98.3822,131.3145,649.7221,349.2375,145.5956,209.7373,231.9208
|
22 |
+
gpt-oss-120b,219.9099,40.9979,88.2543,59.7898,50.0398,66.35,154.4168,213.958,154.396,143.3909,119.1503
|
23 |
+
grok-3-mini,324.7266,28.8678,38.3573,27.4405,58.7259,56.7006,303.2076,79.7071,164.6423,78.6302,116.1006
|
24 |
+
grok-4,330.6722,73.9998,112.4368,75.3662,148.4834,118.3984,908.2874,484.3592,205.0356,168.1656,262.5205
|
25 |
+
llama-3_1-Nemotron-Ultra-253B-v1,299.2445,64.7177,80.6145,53.2406,111.7416,114.9641,677.2227,364.4893,179.9651,73.4696,201.967
|
26 |
+
llama-4-Scout-17B-16E-Instruct,119.4443,17.908,21.721,15.137,15.5893,21.4368,21.6394,35.6977,109.5036,18.1442,39.6221
|
27 |
+
llama-4-maverick,258.2904,12.3067,28.3245,14.1619,15.9541,23.4693,237.7464,50.0955,44.6007,26.4254,71.1375
|
28 |
+
magistral-small-2506,50.6671,23.6896,23.0028,17.8342,14.257,27.6318,461.2929,227.3066,22.4139,27.139,89.5235
|
29 |
+
mistral-large-2411,320.7227,28.5833,76.1094,50.5788,34.0307,104.2414,69.1314,52.9922,161.3657,71.1036,96.8859
|
30 |
+
nova-lite-v1,17.362,9.0387,11.6702,10.1896,8.0435,9.778,8.2672,7.7491,9.4719,11.1956,10.2766
|
31 |
+
nova-pro-v1,55.831,13.3815,14.2866,9.7714,24.3369,15.6141,14.2894,23.7601,15.2509,15.0352,20.1557
|
32 |
+
o3,370.6262,215.2039,126.7157,84.4048,96.7106,130.4733,970.1118,427.7559,179.8835,165.4601,276.7346
|
33 |
+
o4-mini,317.1998,49.1399,78.8689,45.2952,67.6997,52.9028,768.0834,246.0076,143.2397,86.9799,185.5417
|
34 |
+
phi-4,28.1176,10.3654,12.3853,13.4812,13.4604,12.3491,14.116,39.9468,13.6159,34.0316,19.1869
|
runs/run_2025-08-14/summary_data.csv
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Model,Iterations,AutoBench,LMArena,AAI Index,MMLU-Pro,Costs (USD),Avg Answer Duration (sec),P99 Answer Duration (sec),Fail Rate %
|
2 |
+
claude-3.5-haiku,393,3.586292962,1317,23326,0.634,0.008262832,11.51902452,17.98,4.15%
|
3 |
+
claude-opus-4-1,387,4.239909895,1446,58830,,0.091256434,48.62490598,32.86,5.61%
|
4 |
+
claude-sonnet-4,393,4.171968576,1399,61000,0.842,0.017099466,33.66639032,82.6,4.15%
|
5 |
+
deepSeek-R1-0528,385,4.18112906,1418,58740,0.849,0.006382309,119.174235,223.47,6.10%
|
6 |
+
deepSeek-V3-0324,392,3.945669087,1390,43990,0.819,0.00119639,40.30336432,106.53,4.39%
|
7 |
+
gemini-2.5-flash,387,4.32099389,1409,58430,0.759,0.004512314,48.7078753,140.54,5.61%
|
8 |
+
gemini-2.5-flash-lite,389,4.017202952,1351,44348,0.832,0.001052718,19.15509939,8.82,5.12%
|
9 |
+
gemini-2.5-pro,388,4.416904571,1458,64630,0.862,0.015866994,65.03115036,64.18,5.37%
|
10 |
+
gemma-3-27b-it,393,3.881640548,1363,25220,0.669,0.00028134,29.7215,79.12,4.15%
|
11 |
+
GLM-4.5,389,4.176558031,1414,56080,0.835,0.00629521,80.74437254,29.19,5.12%
|
12 |
+
GLM-4.5-Air,392,3.98464018,1379,49475,0.815,0.003611243,68.34050587,21.75,4.39%
|
13 |
+
gpt-4.1,392,4.165890881,1406,46770,0.806,0.009144648,32.86274006,23.32,4.39%
|
14 |
+
gpt-5,385,4.511567341,1481,68950,0.871,0.043676351,89.99818067,69.79,6.10%
|
15 |
+
gpt-5-mini,392,4.486571107,,63700,0.828,0.006324841,65.89701176,48.74,4.39%
|
16 |
+
gpt-5-nano,390,4.325926956,,53780,0.772,0.002414388,66.4959839,73.7,4.88%
|
17 |
+
gpt-oss-120b,388,4.479287977,1356,61340,0.808,0.001361942,27.00733404,94.45,5.37%
|
18 |
+
grok-3-mini,391,4.055940505,1360,58010,0.828,0.000895661,26.12147499,23.11,4.63%
|
19 |
+
grok-4,360,4.308828831,1430,67520,0.866,0.029202342,60.95525411,13.82,12.20%
|
20 |
+
Kimi-K2-Instruct,325,4.177138663,1420,48560,0.824,0.002379296,65.0222057,96.77,20.73%
|
21 |
+
llama-3_1-Nemotron-Ultra-253B-v1,391,4.020264345,1345,46420,0.825,0.003451924,61.53657957,29.62,4.63%
|
22 |
+
llama-3_3-Nemotron-Super-49B-v1,392,3.883310532,1324,40473,0.698,0.000455359,32.63831081,12.47,4.39%
|
23 |
+
llama-4-maverick,388,3.640194992,1330,41730,0.809,0.000496574,10.65014104,9.93,5.37%
|
24 |
+
llama-4-Scout,393,3.614481399,1318,33060,0.752,0.000410484,10.86684261,23.67,4.15%
|
25 |
+
magistral-small-2506,390,3.713933337,1347,35950,0.746,0.0019781,17.53939687,52.3,4.88%
|
26 |
+
mistral-large-2411,392,3.714675671,1313,27013,0.697,0.006101138,24.36368715,66.7,4.39%
|
27 |
+
nova-lite-v1,393,3.538201832,1262,24540,0.59,0.00018322,5.288625128,21.75,4.15%
|
28 |
+
nova-pro-v1,389,3.490835422,1289,28830,0.691,0.001800498,7.528192069,23.32,5.12%
|
29 |
+
o3,391,4.409851586,1451,67070,0.853,0.018504711,63.89621339,69.79,4.63%
|
30 |
+
o4-mini,393,4.27410734,1398,65050,0.832,0.008704364,39.05469579,48.74,4.15%
|
31 |
+
phi-4,392,3.657791802,1258,27950,0.714,0.000240436,7.744667446,73.7,4.39%
|
32 |
+
Qwen3-14B,392,3.976245179,,45235,0.774,0.000789184,61.11544056,94.45,4.39%
|
33 |
+
Qwen3-235B-A22B-Thinking-2507,331,4.394399183,1401,63590,0.843,0.00416518,78.79346155,23.11,19.27%
|
34 |
+
Qwen3-30B-A3B,390,3.952481327,1380,42340,0.777,0.000763337,72.64171253,13.82,4.88%
|
static/manifest.json
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"name": "AutoBench Leaderboard",
|
3 |
+
"short_name": "AutoBench",
|
4 |
+
"description": "Interactive leaderboard for AutoBench LLM evaluations",
|
5 |
+
"start_url": "/",
|
6 |
+
"display": "standalone",
|
7 |
+
"background_color": "#ffffff",
|
8 |
+
"theme_color": "#000000",
|
9 |
+
"icons": []
|
10 |
+
}
|