Spaces:
Running
Running
Jae-Won Chung
commited on
Commit
·
b10121d
1
Parent(s):
07a9e13
New leaderboard prototype
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .github/workflows/push_spaces.yaml +2 -0
- .gitignore +2 -0
- README.md +15 -33
- _config.yml +9 -3
- app.py +735 -253
- benchmark/.gitignore +1 -0
- benchmark/README.md +14 -0
- benchmark/common/download_weights.sh +7 -0
- benchmark/common/start_nvml_container.sh +3 -0
- benchmark/diffusion/image-to-video/.dockerignore +1 -0
- benchmark/diffusion/image-to-video/Dockerfile +20 -0
- benchmark/diffusion/image-to-video/README.md +51 -0
- benchmark/diffusion/image-to-video/models/ali-vilab/i2vgen-xl/kwargs.json +4 -0
- benchmark/diffusion/image-to-video/models/ali-vilab/i2vgen-xl/revision.txt +1 -0
- benchmark/diffusion/image-to-video/models/stabilityai/stable-video-diffusion-img2vid-xt/kwargs.json +4 -0
- benchmark/diffusion/image-to-video/models/stabilityai/stable-video-diffusion-img2vid-xt/revision.txt +1 -0
- benchmark/diffusion/image-to-video/models/stabilityai/stable-video-diffusion-img2vid/kwargs.json +4 -0
- benchmark/diffusion/image-to-video/models/stabilityai/stable-video-diffusion-img2vid/revision.txt +1 -0
- benchmark/diffusion/image-to-video/pegasus/A100/hosts_1gpu.yaml +11 -0
- benchmark/diffusion/image-to-video/pegasus/A100/queue_1gpu.yaml +6 -0
- benchmark/diffusion/image-to-video/pegasus/H100/hosts_1gpu.yaml +11 -0
- benchmark/diffusion/image-to-video/pegasus/H100/queue_1gpu.yaml +6 -0
- benchmark/diffusion/image-to-video/requirements.txt +7 -0
- benchmark/diffusion/image-to-video/scripts/aggregate_leaderboard_data.py +38 -0
- benchmark/diffusion/image-to-video/scripts/aggregate_leaderboard_models.py +36 -0
- benchmark/diffusion/image-to-video/scripts/benchmark_one_datapoint.py +300 -0
- benchmark/diffusion/image-to-video/scripts/benchmark_one_model.py +84 -0
- benchmark/diffusion/image-to-video/sharegpt4video/.gitignore +1 -0
- benchmark/diffusion/image-to-video/sharegpt4video/README.md +32 -0
- benchmark/diffusion/image-to-video/sharegpt4video/extract_first_frame.py +21 -0
- benchmark/diffusion/image-to-video/sharegpt4video/sample.py +29 -0
- benchmark/diffusion/image-to-video/sharegpt4video/sharegpt4video_100.json +0 -0
- benchmark/diffusion/text-to-image/.dockerignore +1 -0
- benchmark/diffusion/text-to-image/Dockerfile +20 -0
- benchmark/diffusion/text-to-image/README.md +48 -0
- benchmark/diffusion/text-to-image/models/SimianLuo/LCM_Dreamshaper_v7/kwargs.json +3 -0
- benchmark/diffusion/text-to-image/models/SimianLuo/LCM_Dreamshaper_v7/revision.txt +1 -0
- benchmark/diffusion/text-to-image/models/kandinsky-community/kandinsky-2-2-decoder/kwargs.json +3 -0
- benchmark/diffusion/text-to-image/models/kandinsky-community/kandinsky-2-2-decoder/revision.txt +1 -0
- benchmark/diffusion/text-to-image/models/kandinsky-community/kandinsky-3/kwargs.json +4 -0
- benchmark/diffusion/text-to-image/models/kandinsky-community/kandinsky-3/revision.txt +1 -0
- benchmark/diffusion/text-to-image/models/prompthero/openjourney-v4/kwargs.json +3 -0
- benchmark/diffusion/text-to-image/models/prompthero/openjourney-v4/revision.txt +1 -0
- benchmark/diffusion/text-to-image/models/segmind/SSD-1B/kwargs.json +4 -0
- benchmark/diffusion/text-to-image/models/segmind/SSD-1B/revision.txt +1 -0
- benchmark/diffusion/text-to-image/models/stabilityai/sdxl-turbo/kwargs.json +4 -0
- benchmark/diffusion/text-to-image/models/stabilityai/sdxl-turbo/revision.txt +1 -0
- benchmark/diffusion/text-to-image/models/stabilityai/stable-cascade/kwargs.json +4 -0
- benchmark/diffusion/text-to-image/models/stabilityai/stable-cascade/revision.txt +1 -0
- benchmark/diffusion/text-to-image/models/stabilityai/stable-diffusion-2-1/kwargs.json +4 -0
.github/workflows/push_spaces.yaml
CHANGED
@@ -1,6 +1,7 @@
|
|
1 |
name: Deploy
|
2 |
|
3 |
on:
|
|
|
4 |
push:
|
5 |
branches:
|
6 |
- master
|
@@ -34,6 +35,7 @@ jobs:
|
|
34 |
env:
|
35 |
HF_TOKEN: ${{ secrets.HF_TOKEN }}
|
36 |
run: |
|
|
|
37 |
for i in 1 2 3 4 5; do
|
38 |
git push -f https://jaywonchung:[email protected]/spaces/ml-energy/leaderboard master:main && break || sleep 5;
|
39 |
done
|
|
|
1 |
name: Deploy
|
2 |
|
3 |
on:
|
4 |
+
workflow_dispatch:
|
5 |
push:
|
6 |
branches:
|
7 |
- master
|
|
|
35 |
env:
|
36 |
HF_TOKEN: ${{ secrets.HF_TOKEN }}
|
37 |
run: |
|
38 |
+
git lfs install
|
39 |
for i in 1 2 3 4 5; do
|
40 |
git push -f https://jaywonchung:[email protected]/spaces/ml-energy/leaderboard master:main && break || sleep 5;
|
41 |
done
|
.gitignore
CHANGED
@@ -12,7 +12,9 @@ pyrightconfig.json
|
|
12 |
# Python
|
13 |
*.egg-info
|
14 |
**/__pycache__
|
|
|
15 |
build/
|
|
|
16 |
|
17 |
# Data files
|
18 |
*.log
|
|
|
12 |
# Python
|
13 |
*.egg-info
|
14 |
**/__pycache__
|
15 |
+
**/.ipynb_checkpoints
|
16 |
build/
|
17 |
+
**.ipynb
|
18 |
|
19 |
# Data files
|
20 |
*.log
|
README.md
CHANGED
@@ -15,49 +15,31 @@ tags: ["energy", "leaderboard"]
|
|
15 |
[![Deploy](https://github.com/ml-energy/leaderboard/actions/workflows/push_spaces.yaml/badge.svg?branch=web)](https://github.com/ml-energy/leaderboard/actions/workflows/push_spaces.yaml)
|
16 |
[![Apache-2.0 License](https://custom-icon-badges.herokuapp.com/github/license/ml-energy/leaderboard?logo=law)](/LICENSE)
|
17 |
|
18 |
-
How much energy do LLMs consume?
|
19 |
|
20 |
This README focuses on explaining how to run the benchmark yourself.
|
21 |
The actual leaderboard is here: https://ml.energy/leaderboard.
|
22 |
|
23 |
-
##
|
24 |
-
|
25 |
-
We instrumented [Hugging Face TGI](https://github.com/huggingface/text-generation-inference) so that it measures and returns GPU energy consumption.
|
26 |
-
Then, our [controller](/spitfight/colosseum/controller) server receives user prompts from the [Gradio app](/app.py), selects two models randomly, and streams model responses back with energy consumption.
|
27 |
-
|
28 |
-
## Setup for benchmarking
|
29 |
-
|
30 |
-
### Model weights
|
31 |
-
|
32 |
-
- For models that are directly accessible in Hugging Face Hub, you don't need to do anything.
|
33 |
-
- For other models, convert them to Hugging Face format and put them in `/data/leaderboard/weights/lmsys/vicuna-13B`, for example. The last two path components (e.g., `lmsys/vicuna-13B`) are taken as the name of the model.
|
34 |
-
|
35 |
-
### Docker container
|
36 |
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
mlenergy/leaderboard:latest bash
|
46 |
```
|
47 |
|
48 |
-
|
49 |
-
If needed, the repository should be mounted to `/workspace/leaderboard` to override the copy of the repository inside the container.
|
50 |
-
|
51 |
-
## Running the benchmark
|
52 |
|
53 |
-
We
|
|
|
54 |
|
55 |
-
|
56 |
|
57 |
-
|
58 |
-
$ docker exec leaderboard0 python scripts/benchmark.py --model-path /data/leaderboard/weights/lmsys/vicuna-13B --input-file sharegpt/sg_90k_part1_html_cleaned_lang_first_sampled_sorted.json
|
59 |
-
$ docker exec leaderboard0 python scripts/benchmark.py --model-path databricks/dolly-v2-12b --input-file sharegpt/sg_90k_part1_html_cleaned_lang_first_sampled_sorted.json
|
60 |
-
```
|
61 |
|
62 |
## Citation
|
63 |
|
|
|
15 |
[![Deploy](https://github.com/ml-energy/leaderboard/actions/workflows/push_spaces.yaml/badge.svg?branch=web)](https://github.com/ml-energy/leaderboard/actions/workflows/push_spaces.yaml)
|
16 |
[![Apache-2.0 License](https://custom-icon-badges.herokuapp.com/github/license/ml-energy/leaderboard?logo=law)](/LICENSE)
|
17 |
|
18 |
+
How much energy do GenAI models like LLMs and Diffusion models consume?
|
19 |
|
20 |
This README focuses on explaining how to run the benchmark yourself.
|
21 |
The actual leaderboard is here: https://ml.energy/leaderboard.
|
22 |
|
23 |
+
## Repository Organization
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
|
25 |
+
```
|
26 |
+
leaderboard/
|
27 |
+
├── benchmark/ # Benchmark scripts & instructions
|
28 |
+
├── data/ # Benchmark results
|
29 |
+
├── deployment/ # Colosseum deployment files
|
30 |
+
├── spitfight/ # Python package for the Colosseum
|
31 |
+
├── app.py # Leaderboard Gradio app definition
|
32 |
+
└── index.html # Embeds the leaderboard HuggingFace Space
|
|
|
33 |
```
|
34 |
|
35 |
+
## Colosseum
|
|
|
|
|
|
|
36 |
|
37 |
+
We instrumented [Hugging Face TGI](https://github.com/huggingface/text-generation-inference) so that it measures and returns GPU energy consumption.
|
38 |
+
Then, our [controller](/spitfight/colosseum/controller) server receives user prompts from the [Gradio app](/app.py), selects two models randomly, and streams model responses back with energy consumption.
|
39 |
|
40 |
+
## Running the Benchmark
|
41 |
|
42 |
+
We open-sourced the entire benchmark with instructions here: [`./benchmark`](./benchmark)
|
|
|
|
|
|
|
43 |
|
44 |
## Citation
|
45 |
|
_config.yml
CHANGED
@@ -1,6 +1,12 @@
|
|
1 |
exclude:
|
|
|
2 |
- deployment/
|
3 |
-
-
|
4 |
-
-
|
5 |
-
- sharegpt/
|
6 |
- tests/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
exclude:
|
2 |
+
- benchmark/
|
3 |
- deployment/
|
4 |
+
- spitfight/
|
5 |
+
- docs/
|
|
|
6 |
- tests/
|
7 |
+
- .gitignore
|
8 |
+
- app.py
|
9 |
+
- LICENSE
|
10 |
+
- README.md
|
11 |
+
- requirements.txt
|
12 |
+
- setup.py
|
app.py
CHANGED
@@ -1,5 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
from __future__ import annotations
|
2 |
|
|
|
3 |
import copy
|
4 |
import json
|
5 |
import random
|
@@ -9,16 +16,13 @@ import itertools
|
|
9 |
import contextlib
|
10 |
import argparse
|
11 |
import os
|
12 |
-
from
|
|
|
13 |
from dateutil import parser, tz
|
14 |
|
15 |
import numpy as np
|
16 |
import gradio as gr
|
17 |
import pandas as pd
|
18 |
-
import plotly.io as pio
|
19 |
-
import plotly.express as px
|
20 |
-
from pandas.api.types import is_numeric_dtype, is_float_dtype
|
21 |
-
pio.templates.default = "plotly_white"
|
22 |
|
23 |
from spitfight.colosseum.client import ControllerClient
|
24 |
|
@@ -28,8 +32,499 @@ COLOSSUMM_YOUTUBE_DEMO_EMBED_HTML = '<div style="width: 100%; min-width: 400px;"
|
|
28 |
|
29 |
|
30 |
class TableManager:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
def __init__(self, data_dir: str) -> None:
|
32 |
-
"""Load leaderboard data from CSV files in data_dir.
|
33 |
|
34 |
Inside `data_dir`, there should be:
|
35 |
- `models.json`: a JSON file containing information about each model.
|
@@ -58,6 +553,7 @@ class TableManager:
|
|
58 |
f'<a style="text-decoration: underline; text-decoration-style: dotted" '
|
59 |
f'target="_blank" href="{url}">{nickname}</a>'
|
60 |
)
|
|
|
61 |
df["model"] = df["model"].apply(format_model_link)
|
62 |
|
63 |
# Sort by our 'energy efficiency' score.
|
@@ -110,63 +606,6 @@ class TableManager:
|
|
110 |
"""Formats into HTML that prints in Monospace font."""
|
111 |
return f"<pre style='font-family: monospace'>{text}</pre>"
|
112 |
|
113 |
-
def add_column(self, column_name: str, formula: str):
|
114 |
-
"""Create and add a new column with the given formula."""
|
115 |
-
# If the user did not provide the name of the new column,
|
116 |
-
# generate a unique name for them.
|
117 |
-
if not column_name:
|
118 |
-
counter = 1
|
119 |
-
while (column_name := f"custom{counter}") in self.full_df.columns:
|
120 |
-
counter += 1
|
121 |
-
|
122 |
-
# If the user did not provide a formula, return an error message.
|
123 |
-
if not formula:
|
124 |
-
return self.cur_df, self._format_msg("Please enter a formula.")
|
125 |
-
|
126 |
-
# If there is an equal sign in the formula, `df.eval` will
|
127 |
-
# return an entire DataFrame with the new column, instead of
|
128 |
-
# just the new column. This is not what we want, so we check
|
129 |
-
# for this case and return an error message.
|
130 |
-
if "=" in formula:
|
131 |
-
return self.cur_df, self._format_msg("Invalid formula: expr cannot contain '='.")
|
132 |
-
|
133 |
-
# The user may want to update an existing column.
|
134 |
-
verb = "Updated" if column_name in self.full_df.columns else "Added"
|
135 |
-
|
136 |
-
# Evaluate the formula and catch any error.
|
137 |
-
try:
|
138 |
-
# Give the users some helper functions that can be used in the formula
|
139 |
-
# like "@sum(response_length)". Also wipe out some global variables.
|
140 |
-
col = self.full_df.eval(
|
141 |
-
formula,
|
142 |
-
local_dict={"sum": sum, "len": len, "max": max, "min": min},
|
143 |
-
global_dict={"global_tbm": None},
|
144 |
-
)
|
145 |
-
except Exception as exc:
|
146 |
-
return self.cur_df, self._format_msg(f"Invalid formula: {exc}")
|
147 |
-
|
148 |
-
# If the result is a numeric scalar, make it a Series.
|
149 |
-
# We may have deleted some models (rows) form the full dataframe when we
|
150 |
-
# called dropna, so we need to query the maximum index instead of taking len.
|
151 |
-
if isinstance(col, (int, float)):
|
152 |
-
col = pd.Series([col] * (self.full_df.index.max() + 1))
|
153 |
-
# We only accept numeric columns.
|
154 |
-
if not is_numeric_dtype(col):
|
155 |
-
return self.cur_df, self._format_msg("Invalid formula: result must be numeric.")
|
156 |
-
# Round if it's floating point.
|
157 |
-
if is_float_dtype(col):
|
158 |
-
col = col.round(2)
|
159 |
-
|
160 |
-
# If the column already exists, update it.
|
161 |
-
if column_name in self.full_df.columns:
|
162 |
-
self.full_df[column_name] = col
|
163 |
-
else:
|
164 |
-
self.full_df.insert(len(self.schema) + 1, column_name, col)
|
165 |
-
|
166 |
-
# If adding a column succeeded, `self.cur_df` should also be updated.
|
167 |
-
self.cur_df = self.full_df.loc[self.cur_index]
|
168 |
-
return self.cur_df, self._format_msg(f"{verb} column '{column_name}'.")
|
169 |
-
|
170 |
def get_dropdown(self):
|
171 |
columns = self.full_df.columns.tolist()[1:]
|
172 |
return [
|
@@ -196,51 +635,40 @@ class TableManager:
|
|
196 |
self.cur_index = index
|
197 |
return self.cur_df
|
198 |
|
199 |
-
def
|
200 |
-
|
201 |
-
|
202 |
-
|
203 |
-
|
204 |
-
|
205 |
-
|
206 |
-
|
207 |
-
|
208 |
-
|
209 |
-
|
210 |
-
|
211 |
-
|
212 |
-
|
213 |
-
|
214 |
-
|
215 |
-
|
216 |
-
|
217 |
-
text = self.cur_df["model"].apply(lambda x: x.split(">")[1].split("<")[0])
|
218 |
-
# Hide model names since they clutter the plots, and only show them on hover.
|
219 |
-
if z is None or z == "None" or z == "":
|
220 |
-
fig = px.scatter(self.cur_df, x=x, y=y, hover_name=text)
|
221 |
-
else:
|
222 |
-
fig = px.scatter_3d(self.cur_df, x=x, y=y, z=z, hover_name=text)
|
223 |
-
fig.update_traces(marker=dict(size=12, line=dict(width=2, color="DarkSlateGrey")))
|
224 |
-
fig.update_layout(width=width, height=height)
|
225 |
|
226 |
-
return fig, width, height, ""
|
227 |
|
228 |
# The global instance of the TableManager should only be used when
|
229 |
# initializing components in the Gradio interface. If the global instance
|
230 |
# is mutated while handling user sessions, the change will be reflected
|
231 |
# in every user session. Instead, the instance provided by gr.State should
|
232 |
# be used.
|
233 |
-
|
234 |
-
|
235 |
-
|
236 |
-
|
237 |
-
|
238 |
-
|
239 |
-
|
240 |
-
|
241 |
-
|
242 |
-
current_datetime = parser.parse(resp.json()["commit"]["author"]["date"])
|
243 |
-
current_date = current_datetime.astimezone(tz.gettz("US/Eastern")).strftime("%Y-%m-%d")
|
244 |
|
245 |
# Custom JS.
|
246 |
# XXX: This is a hack to make the model names clickable.
|
@@ -254,11 +682,14 @@ else:
|
|
254 |
dataframe_update_js = f"""
|
255 |
function format_model_link() {{
|
256 |
// Iterate over the cells of the first column of the leaderboard table.
|
257 |
-
|
258 |
-
|
259 |
-
|
260 |
-
|
261 |
-
);
|
|
|
|
|
|
|
262 |
|
263 |
// If nothing was found, it likely means that now the visible table has less rows
|
264 |
// than the full table. This happens when the user filters the table. In this case,
|
@@ -282,6 +713,7 @@ function format_model_link() {{
|
|
282 |
// Replace the innerHTML of the cell with the interpreted HTML.
|
283 |
cell.replaceChildren(model_anchor);
|
284 |
}}
|
|
|
285 |
|
286 |
// Return all arguments as is.
|
287 |
return arguments
|
@@ -365,25 +797,26 @@ table th:first-child {
|
|
365 |
}
|
366 |
"""
|
367 |
|
368 |
-
intro_text = """
|
369 |
-
<h2>How much energy do modern Large Language Models (LLMs) consume for inference?</h2>
|
370 |
-
|
371 |
-
<p style="font-size: 16px">We used <a href="https://ml.energy/zeus">Zeus</a> to benchmark various open source LLMs in terms of how much time and energy they consume for inference.
|
372 |
-
Time and energy are of course not the only things we care about -- so we also benchmarked all of the models on a variety of NLP datasets,
|
373 |
-
including the ARC Challenge (reasoning), HellaSwag (common sense), and TruthfulQA (truthfulness).</p>
|
374 |
-
|
375 |
-
<p style="font-size: 16px">For more detailed information, please take a look at the <b>About</b> tab.
|
376 |
-
Every benchmark is limited in some sense -- Before you interpret the results, please take a look at the <b>Limitations</b> section there, too.</p>
|
377 |
-
"""
|
378 |
-
|
379 |
# The app will not start without a controller address set.
|
380 |
controller_addr = os.environ.get("COLOSSEUM_CONTROLLER_ADDR")
|
381 |
if controller_addr is None:
|
382 |
COLOSSEUM_UP = False
|
383 |
-
COLOSSEUM_DOWN_MESSAGE = "<br/><h2 style='text-align: center'>
|
384 |
controller_addr = "localhost"
|
385 |
global_controller_client = ControllerClient(controller_addr=controller_addr, timeout=15)
|
386 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
387 |
# Load the list of models. To reload, the app should be restarted.
|
388 |
RANDOM_MODEL_NAME = "Random"
|
389 |
RANDOM_USER_PREFERENCE = "Two random models"
|
@@ -392,12 +825,19 @@ model_name_to_user_pref = {model: f"One is {model}" for model in global_availabl
|
|
392 |
model_name_to_user_pref[RANDOM_MODEL_NAME] = RANDOM_USER_PREFERENCE
|
393 |
user_pref_to_model_name = {v: k for k, v in model_name_to_user_pref.items()}
|
394 |
|
|
|
395 |
# Colosseum helper functions.
|
396 |
-
def enable_interact():
|
397 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
398 |
|
399 |
-
def disable_interact():
|
400 |
-
return [gr.update(interactive=False)] * 2
|
401 |
|
402 |
def consumed_less_energy_message(energy_a, energy_b):
|
403 |
"""Return a message that indicates that the user chose the model that consumed less energy.
|
@@ -410,6 +850,7 @@ def consumed_less_energy_message(energy_a, energy_b):
|
|
410 |
how_much = f"{1 / factor:.1f}x" if factor <= 0.5 else f"{100 - factor * 100:.1f}%"
|
411 |
return f"<h2>That response also <span class='green-text'>consumed {how_much} less energy</span> ({energy_a:,.0f} J vs. {energy_b:,.0f} J)!</h2>"
|
412 |
|
|
|
413 |
def consumed_more_energy_message(energy_a, energy_b):
|
414 |
"""Return a message that indicates that the user chose the model that consumed more energy.
|
415 |
|
@@ -421,14 +862,23 @@ def consumed_more_energy_message(energy_a, energy_b):
|
|
421 |
how_much = f"{factor:.1f}x" if factor >= 2.0 else f"{factor * 100 - 100:.1f}%"
|
422 |
return f"<h2>That response <span class='red-text'>consumed {how_much} more energy</span> ({energy_a:,.0f} J vs. {energy_b:,.0f} J).</h2>"
|
423 |
|
|
|
424 |
# Colosseum event handlers
|
425 |
def on_load():
|
426 |
"""Intialize the dataframe, shuffle the model preference dropdown choices."""
|
427 |
-
dataframe =
|
|
|
428 |
available_models = copy.deepcopy(global_available_models)
|
429 |
random.shuffle(available_models)
|
430 |
available_models.insert(0, RANDOM_MODEL_NAME)
|
431 |
-
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
432 |
|
433 |
def add_prompt_disable_submit(prompt, history_a, history_b):
|
434 |
"""Add the user's prompt to the two model's history and disable further submission."""
|
@@ -442,12 +892,17 @@ def add_prompt_disable_submit(prompt, history_a, history_b):
|
|
442 |
client,
|
443 |
]
|
444 |
|
|
|
445 |
def generate_responses(client: ControllerClient, user_preference, history_a, history_b):
|
446 |
"""Generate responses for the two models."""
|
447 |
model_preference = user_pref_to_model_name[user_preference]
|
448 |
for resp_a, resp_b in itertools.zip_longest(
|
449 |
-
client.prompt(
|
450 |
-
|
|
|
|
|
|
|
|
|
451 |
):
|
452 |
if resp_a is not None:
|
453 |
history_a[-1][1] += resp_a
|
@@ -455,8 +910,10 @@ def generate_responses(client: ControllerClient, user_preference, history_a, his
|
|
455 |
history_b[-1][1] += resp_b
|
456 |
yield [history_a, history_b]
|
457 |
|
|
|
458 |
def make_resp_vote_func(victory_index: Literal[0, 1]):
|
459 |
"""Return a function that will be called when the user clicks on response preference vote buttons."""
|
|
|
460 |
def resp_vote_func(client: ControllerClient):
|
461 |
vote_response = client.response_vote(victory_index=victory_index)
|
462 |
model_name_a, model_name_b = map(lambda n: f"## {n}", vote_response.model_names)
|
@@ -491,10 +948,13 @@ def make_resp_vote_func(victory_index: Literal[0, 1]):
|
|
491 |
# Keep the reset button disabled
|
492 |
gr.Button.update(visible=False, interactive=False),
|
493 |
]
|
|
|
494 |
return resp_vote_func
|
495 |
|
|
|
496 |
def make_energy_vote_func(is_worth: bool):
|
497 |
"""Return a function that will be called when the user clicks on energy vote buttons."""
|
|
|
498 |
def energy_vote_func(client: ControllerClient, energy_message: str):
|
499 |
vote_response = client.energy_vote(is_worth=is_worth)
|
500 |
model_name_a, model_name_b = map(lambda n: f"## {n}", vote_response.model_names)
|
@@ -508,8 +968,10 @@ def make_energy_vote_func(is_worth: bool):
|
|
508 |
# Append to the energy comparison message
|
509 |
energy_message[:-5] + (" Fair enough.</h2>" if is_worth else " Wasn't worth it.</h2>"),
|
510 |
]
|
|
|
511 |
return energy_vote_func
|
512 |
|
|
|
513 |
def play_again():
|
514 |
available_models = copy.deepcopy(global_available_models)
|
515 |
random.shuffle(available_models)
|
@@ -524,11 +986,16 @@ def play_again():
|
|
524 |
# Hide energy vote buttons and message
|
525 |
gr.Button.update(visible=False), gr.Button.update(visible=False), gr.Markdown.update(visible=False),
|
526 |
# Enable model preference dropdown and shuffle choices
|
527 |
-
gr.Dropdown.update(
|
|
|
|
|
|
|
|
|
528 |
# Disable reset button
|
529 |
gr.Button.update(interactive=False, visible=False),
|
530 |
]
|
531 |
|
|
|
532 |
focus_prompt_input_js = """
|
533 |
function() {
|
534 |
for (let textarea of document.getElementsByTagName("textarea")) {
|
@@ -541,13 +1008,17 @@ function() {
|
|
541 |
"""
|
542 |
|
543 |
with gr.Blocks(css=custom_css) as block:
|
544 |
-
tbm = gr.State(
|
|
|
|
|
545 |
with gr.Box():
|
546 |
-
gr.HTML(
|
|
|
|
|
547 |
|
548 |
with gr.Tabs():
|
549 |
# Tab: Colosseum.
|
550 |
-
with gr.
|
551 |
if COLOSSEUM_UP:
|
552 |
gr.Markdown(open("docs/colosseum_top.md").read())
|
553 |
else:
|
@@ -587,32 +1058,64 @@ with gr.Blocks(css=custom_css) as block:
|
|
587 |
resp_vote_btn_list: list[gr.component.Component] = []
|
588 |
with gr.Column():
|
589 |
with gr.Row():
|
590 |
-
masked_model_names.append(
|
|
|
|
|
591 |
with gr.Row():
|
592 |
-
chatbots.append(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
593 |
with gr.Row():
|
594 |
-
left_resp_vote_btn = gr.Button(
|
|
|
|
|
595 |
resp_vote_btn_list.append(left_resp_vote_btn)
|
596 |
|
597 |
with gr.Column():
|
598 |
with gr.Row():
|
599 |
-
masked_model_names.append(
|
|
|
|
|
600 |
with gr.Row():
|
601 |
-
chatbots.append(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
602 |
with gr.Row():
|
603 |
-
right_resp_vote_btn = gr.Button(
|
|
|
|
|
604 |
resp_vote_btn_list.append(right_resp_vote_btn)
|
605 |
|
606 |
with gr.Row():
|
607 |
energy_comparison_message = gr.HTML(visible=False)
|
608 |
|
609 |
with gr.Row():
|
610 |
-
worth_energy_vote_btn = gr.Button(
|
611 |
-
|
612 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
613 |
|
614 |
with gr.Row():
|
615 |
-
play_again_btn = gr.Button(
|
|
|
|
|
616 |
|
617 |
gr.Markdown(open("docs/colosseum_bottom.md").read())
|
618 |
|
@@ -622,11 +1125,11 @@ with gr.Blocks(css=custom_css) as block:
|
|
622 |
(prompt_input
|
623 |
.submit(add_prompt_disable_submit, [prompt_input, *chatbots], [prompt_input, prompt_submit_btn, model_preference_dropdown, *chatbots, controller_client], queue=False)
|
624 |
.then(generate_responses, [controller_client, model_preference_dropdown, *chatbots], [*chatbots], queue=True, show_progress="hidden")
|
625 |
-
.then(enable_interact, None, resp_vote_btn_list, queue=False))
|
626 |
(prompt_submit_btn
|
627 |
.click(add_prompt_disable_submit, [prompt_input, *chatbots], [prompt_input, prompt_submit_btn, model_preference_dropdown, *chatbots, controller_client], queue=False)
|
628 |
.then(generate_responses, [controller_client, model_preference_dropdown, *chatbots], [*chatbots], queue=True, show_progress="hidden")
|
629 |
-
.then(enable_interact, None, resp_vote_btn_list, queue=False))
|
630 |
|
631 |
left_resp_vote_btn.click(
|
632 |
make_resp_vote_func(victory_index=0),
|
@@ -663,128 +1166,100 @@ with gr.Blocks(css=custom_css) as block:
|
|
663 |
)
|
664 |
.then(None, _js=focus_prompt_input_js, queue=False))
|
665 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
666 |
|
667 |
-
|
668 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
669 |
with gr.Box():
|
670 |
-
gr.HTML(
|
671 |
|
672 |
# Block: Checkboxes to select benchmarking parameters.
|
673 |
with gr.Row():
|
674 |
with gr.Box():
|
675 |
gr.Markdown("### Benchmark results to show")
|
676 |
checkboxes: list[gr.CheckboxGroup] = []
|
677 |
-
for key, choices in
|
678 |
# Specifying `value` makes everything checked by default.
|
679 |
-
checkboxes.append(
|
|
|
|
|
|
|
|
|
680 |
|
681 |
# Block: Leaderboard table.
|
682 |
with gr.Row():
|
683 |
-
dataframe = gr.Dataframe(
|
|
|
|
|
684 |
# Make sure the models have clickable links.
|
685 |
dataframe.change(None, None, None, _js=dataframe_update_js, queue=False)
|
686 |
# Table automatically updates when users check or uncheck any checkbox.
|
687 |
for checkbox in checkboxes:
|
688 |
-
checkbox.change(
|
689 |
-
|
690 |
-
|
691 |
-
|
692 |
-
gr.Markdown("### Add custom columns to the table")
|
693 |
-
with gr.Row():
|
694 |
-
with gr.Column(scale=3):
|
695 |
-
with gr.Row():
|
696 |
-
colname_input = gr.Textbox(lines=1, label="Custom column name")
|
697 |
-
formula_input = gr.Textbox(lines=1, label="Formula (@sum, @len, @max, and @min are supported)")
|
698 |
-
with gr.Column(scale=1):
|
699 |
-
with gr.Row():
|
700 |
-
add_col_btn = gr.Button("Add to table (⏎)", elem_classes=["btn-submit"])
|
701 |
-
with gr.Row():
|
702 |
-
clear_input_btn = gr.Button("Clear")
|
703 |
-
with gr.Row():
|
704 |
-
add_col_message = gr.HTML("")
|
705 |
-
gr.Examples(
|
706 |
-
examples=[
|
707 |
-
["power", "energy / latency"],
|
708 |
-
["token_per_joule", "response_length / energy"],
|
709 |
-
["verbose", "response_length > @sum(response_length) / @len(response_length)"],
|
710 |
-
],
|
711 |
-
inputs=[colname_input, formula_input],
|
712 |
-
)
|
713 |
-
colname_input.submit(
|
714 |
-
TableManager.add_column,
|
715 |
-
inputs=[tbm, colname_input, formula_input],
|
716 |
-
outputs=[dataframe, add_col_message],
|
717 |
-
queue=False,
|
718 |
-
)
|
719 |
-
formula_input.submit(
|
720 |
-
TableManager.add_column,
|
721 |
-
inputs=[tbm, colname_input, formula_input],
|
722 |
-
outputs=[dataframe, add_col_message],
|
723 |
-
queue=False,
|
724 |
-
)
|
725 |
-
add_col_btn.click(
|
726 |
-
TableManager.add_column,
|
727 |
-
inputs=[tbm, colname_input, formula_input],
|
728 |
-
outputs=[dataframe, add_col_message],
|
729 |
-
queue=False,
|
730 |
-
)
|
731 |
-
clear_input_btn.click(
|
732 |
-
lambda: (None, None, None),
|
733 |
-
inputs=None,
|
734 |
-
outputs=[colname_input, formula_input, add_col_message],
|
735 |
-
queue=False,
|
736 |
-
)
|
737 |
-
|
738 |
-
# Block: Allow users to plot 2D and 3D scatter plots.
|
739 |
-
with gr.Box():
|
740 |
-
gr.Markdown("### Scatter plot (Hover over marker to show model name)")
|
741 |
-
with gr.Row():
|
742 |
-
with gr.Column(scale=3):
|
743 |
-
with gr.Row():
|
744 |
-
# Initialize the dropdown choices with the global TableManager with just the original columns.
|
745 |
-
axis_dropdowns = global_tbm.get_dropdown()
|
746 |
-
with gr.Column(scale=1):
|
747 |
-
with gr.Row():
|
748 |
-
plot_btn = gr.Button("Plot", elem_classes=["btn-submit"])
|
749 |
-
with gr.Row():
|
750 |
-
clear_plot_btn = gr.Button("Clear")
|
751 |
-
with gr.Accordion("Plot size (600 x 600 by default)", open=False):
|
752 |
-
with gr.Row():
|
753 |
-
plot_width_input = gr.Textbox("600", lines=1, label="Width (px)")
|
754 |
-
plot_height_input = gr.Textbox("600", lines=1, label="Height (px)")
|
755 |
-
with gr.Row():
|
756 |
-
plot = gr.Plot(value=global_tbm.plot_scatter(
|
757 |
-
plot_width_input.value,
|
758 |
-
plot_height_input.value,
|
759 |
-
x=axis_dropdowns[0].value,
|
760 |
-
y=axis_dropdowns[1].value,
|
761 |
-
z=axis_dropdowns[2].value,
|
762 |
-
)[0]) # type: ignore
|
763 |
-
with gr.Row():
|
764 |
-
plot_message = gr.HTML("")
|
765 |
-
add_col_btn.click(TableManager.update_dropdown, inputs=tbm, outputs=axis_dropdowns, queue=False) # type: ignore
|
766 |
-
plot_width_input.submit(
|
767 |
-
TableManager.plot_scatter,
|
768 |
-
inputs=[tbm, plot_width_input, plot_height_input, *axis_dropdowns],
|
769 |
-
outputs=[plot, plot_width_input, plot_height_input, plot_message],
|
770 |
-
queue=False,
|
771 |
-
)
|
772 |
-
plot_height_input.submit(
|
773 |
-
TableManager.plot_scatter,
|
774 |
-
inputs=[tbm, plot_width_input, plot_height_input, *axis_dropdowns],
|
775 |
-
outputs=[plot, plot_width_input, plot_height_input, plot_message],
|
776 |
-
queue=False,
|
777 |
-
)
|
778 |
-
plot_btn.click(
|
779 |
-
TableManager.plot_scatter,
|
780 |
-
inputs=[tbm, plot_width_input, plot_height_input, *axis_dropdowns],
|
781 |
-
outputs=[plot, plot_width_input, plot_height_input, plot_message],
|
782 |
-
queue=False,
|
783 |
-
)
|
784 |
-
clear_plot_btn.click(
|
785 |
-
lambda: (None,) * 7,
|
786 |
-
None,
|
787 |
-
outputs=[*axis_dropdowns, plot, plot_width_input, plot_height_input, plot_message],
|
788 |
queue=False,
|
789 |
)
|
790 |
|
@@ -794,8 +1269,7 @@ with gr.Blocks(css=custom_css) as block:
|
|
794 |
|
795 |
# Tab: About page.
|
796 |
with gr.Tab("About"):
|
797 |
-
|
798 |
-
gr.Markdown(open("docs/leaderboard.md").read())
|
799 |
|
800 |
# Citation
|
801 |
with gr.Accordion("📚 Citation", open=False, elem_id="citation-header"):
|
@@ -809,13 +1283,21 @@ with gr.Blocks(css=custom_css) as block:
|
|
809 |
)
|
810 |
|
811 |
# Load the table on page load.
|
812 |
-
block.load(
|
|
|
|
|
|
|
|
|
813 |
|
814 |
|
815 |
if __name__ == "__main__":
|
816 |
parser = argparse.ArgumentParser()
|
817 |
-
parser.add_argument(
|
|
|
|
|
818 |
parser.add_argument("--concurrency", type=int, default=50)
|
819 |
|
820 |
args = parser.parse_args()
|
821 |
-
block.queue(concurrency_count=args.concurrency, api_open=False).launch(
|
|
|
|
|
|
1 |
+
"""Gradio app for the ML.ENERGY leaderboard.
|
2 |
+
|
3 |
+
Everything is in a single file. Search for `gr.Blocks` to find the place
|
4 |
+
where UI elements are actually defined.
|
5 |
+
"""
|
6 |
+
|
7 |
from __future__ import annotations
|
8 |
|
9 |
+
from abc import abstractmethod
|
10 |
import copy
|
11 |
import json
|
12 |
import random
|
|
|
16 |
import contextlib
|
17 |
import argparse
|
18 |
import os
|
19 |
+
from pathlib import Path
|
20 |
+
from typing import Literal, Any
|
21 |
from dateutil import parser, tz
|
22 |
|
23 |
import numpy as np
|
24 |
import gradio as gr
|
25 |
import pandas as pd
|
|
|
|
|
|
|
|
|
26 |
|
27 |
from spitfight.colosseum.client import ControllerClient
|
28 |
|
|
|
32 |
|
33 |
|
34 |
class TableManager:
|
35 |
+
"""Manages the data for the leaderboard tables for tasks."""
|
36 |
+
|
37 |
+
def __init__(self, data_dir: str) -> None:
|
38 |
+
"""Load leaderboard data from files in `data_dir`.
|
39 |
+
|
40 |
+
Expected directory structure: `data_dir/gpu_model`.
|
41 |
+
Inside the innermost (GPU) directory, there should be:
|
42 |
+
- `models.json`: JSON file that maps huggingface model IDs to model info.
|
43 |
+
Some models listed in this file may not have benchmark results.
|
44 |
+
- `model_org/model_name/*.json`: JSON files containing the benchmark results.
|
45 |
+
"""
|
46 |
+
self.data_dir = Path(data_dir)
|
47 |
+
|
48 |
+
def __str__(self) -> str:
|
49 |
+
return f"{self.__class__}(data_dir={self.data_dir})"
|
50 |
+
|
51 |
+
def _wrap_model_name(self, url: str, model_name: str) -> str:
|
52 |
+
"""Wrap the model name in an HTML anchor."""
|
53 |
+
return f'<a style="text-decoration: underline; text-decoration-style: dotted" target="_blank" href="{url}">{model_name}</a>'
|
54 |
+
|
55 |
+
def _unwrap_model_name(self, model_name: str) -> str:
|
56 |
+
"""Unwrap the model name from an HTML anchor."""
|
57 |
+
return model_name.split(">")[1].split("<")[0]
|
58 |
+
|
59 |
+
@abstractmethod
|
60 |
+
def get_tab_name(self) -> str:
|
61 |
+
"""Return the name of the leaderboard."""
|
62 |
+
|
63 |
+
@abstractmethod
|
64 |
+
def get_intro_text(self) -> tuple[str, str]:
|
65 |
+
"""Return the type of the introduction text and the introduction text."""
|
66 |
+
|
67 |
+
@abstractmethod
|
68 |
+
def get_detail_text(self) -> tuple[str, str]:
|
69 |
+
"""Return the type of the detail text and the detail text."""
|
70 |
+
|
71 |
+
def get_benchmark_checkboxes(self) -> dict[str, list[str]]:
|
72 |
+
"""Return data for the benchmark selection checkboxes."""
|
73 |
+
return {}
|
74 |
+
|
75 |
+
def get_benchmark_sliders(self) -> dict[str, tuple[float, float, float, float]]:
|
76 |
+
"""Return data for the benchmark selection sliders.
|
77 |
+
|
78 |
+
Dictionary values are tuples of the form (min, max, step, default).
|
79 |
+
"""
|
80 |
+
return {}
|
81 |
+
|
82 |
+
@abstractmethod
|
83 |
+
def get_all_models(self) -> list[str]:
|
84 |
+
"""Return all available models."""
|
85 |
+
|
86 |
+
@abstractmethod
|
87 |
+
def set_filter_get_df(self, *filters) -> pd.DataFrame:
|
88 |
+
"""Set the current set of filters and return the filtered DataFrame."""
|
89 |
+
|
90 |
+
|
91 |
+
class LLMTableManager(TableManager):
|
92 |
+
def __init__(self, data_dir: str, task_name: str) -> None:
|
93 |
+
"""Load leaderboard data from files in `data_dir`.
|
94 |
+
|
95 |
+
Under `data_dir`, there should be:
|
96 |
+
- `models.json`: JSON file that maps huggingface model IDs to model info.
|
97 |
+
Some models listed in this file may not have benchmark results.
|
98 |
+
- `schema.yaml`: YAML file containing the schema of the benchmark.
|
99 |
+
|
100 |
+
Then, benchmark data files are nested under `data_dir` according to the schema.
|
101 |
+
One directory hierarchy for each choice in the schema and then two more -- the
|
102 |
+
model's HuggingFace hub organization and the model name.
|
103 |
+
"""
|
104 |
+
super().__init__(data_dir)
|
105 |
+
|
106 |
+
self.task_name = task_name
|
107 |
+
|
108 |
+
# Read in the data into a Pandas DataFrame.
|
109 |
+
# Important: The ordering `self.schema` determines the directory structure.
|
110 |
+
self.schema = yaml.safe_load(open(self.data_dir / "schema.yaml"))
|
111 |
+
models: dict[str, dict[str, Any]] = json.load(
|
112 |
+
open(self.data_dir / "models.json")
|
113 |
+
)
|
114 |
+
res_df = pd.DataFrame()
|
115 |
+
for choice in itertools.product(*self.schema.values()):
|
116 |
+
result_dir = self.data_dir / "/".join(choice)
|
117 |
+
with contextlib.suppress(FileNotFoundError):
|
118 |
+
for model_id, model_info in models.items():
|
119 |
+
for file in (result_dir / model_id).glob("*.json"):
|
120 |
+
model_df = pd.DataFrame([json.load(open(file))])
|
121 |
+
# Sanity checks and standardization of schema values.
|
122 |
+
assert model_df["Model"].iloc[0] == model_id
|
123 |
+
for key, val in zip(self.schema.keys(), choice):
|
124 |
+
assert (
|
125 |
+
str(val).lower() in str(model_df[key].iloc[0]).lower()
|
126 |
+
)
|
127 |
+
model_df[key] = val
|
128 |
+
# Format the model name as an HTML anchor.
|
129 |
+
model_df["Model"] = self._wrap_model_name(model_info["url"], model_info["nickname"])
|
130 |
+
model_df["Params"] = model_info["params"]
|
131 |
+
res_df = pd.concat([res_df, model_df])
|
132 |
+
|
133 |
+
if res_df.empty:
|
134 |
+
raise ValueError(
|
135 |
+
f"No benchmark JSON files were read from {self.data_dir=}."
|
136 |
+
)
|
137 |
+
|
138 |
+
# Order columns
|
139 |
+
columns = res_df.columns.to_list()
|
140 |
+
cols_to_order = ["Model", "Params"]
|
141 |
+
cols_to_order.extend(self.schema.keys())
|
142 |
+
columns = cols_to_order + [col for col in columns if col not in cols_to_order]
|
143 |
+
res_df = res_df[columns]
|
144 |
+
|
145 |
+
# Order rows
|
146 |
+
res_df = res_df.sort_values(by=["Model", *self.schema.keys(), "Energy/req (J)"])
|
147 |
+
|
148 |
+
self.cur_df = self.full_df = res_df.round(2)
|
149 |
+
|
150 |
+
# We need to set the default view separately when `gr.State` is forked.
|
151 |
+
self.set_filter_get_df()
|
152 |
+
|
153 |
+
def get_benchmark_checkboxes(self) -> dict[str, list[str]]:
|
154 |
+
return self.schema
|
155 |
+
|
156 |
+
def get_benchmark_sliders(self) -> dict[str, tuple[float, float, float, float]]:
|
157 |
+
return {"Target Time Per Output Token (TPOT) (s)": (0.0, 0.5, 0.01, 0.2)}
|
158 |
+
|
159 |
+
def get_all_models(self) -> list[str]:
|
160 |
+
return self.full_df["Model"].apply(self._unwrap_model_name).unique().tolist()
|
161 |
+
|
162 |
+
def set_filter_get_df(self, *filters) -> pd.DataFrame:
|
163 |
+
"""Set the current set of filters and return the filtered DataFrame.
|
164 |
+
|
165 |
+
Filters can either be completely empty, or be a concatenated list of
|
166 |
+
choices from all checkboxes and all sliders.
|
167 |
+
"""
|
168 |
+
# If the filter is empty, we default to the first choice for each checkbox.
|
169 |
+
if not filters:
|
170 |
+
checkboxes = [choices[:1] for choices in self.schema.values()]
|
171 |
+
sliders = [slider[3] for slider in self.get_benchmark_sliders().values()]
|
172 |
+
filters = checkboxes + sliders
|
173 |
+
|
174 |
+
index = np.full(len(self.full_df), True)
|
175 |
+
# Checkboxes
|
176 |
+
for setup, choice in zip(self.schema, filters):
|
177 |
+
index = index & self.full_df[setup].isin(choice)
|
178 |
+
self.cur_df = self.full_df.loc[index]
|
179 |
+
|
180 |
+
# Sliders (We just have TPOT for now.)
|
181 |
+
# For each `Model`, we want to first filter out rows whose `Avg TPOT (s)` is greater than the slider value.
|
182 |
+
# Finally, only just leave the row whose `Energy/req (J)` is the smallest.
|
183 |
+
tpot_slo = filters[-1]
|
184 |
+
self.cur_df = (
|
185 |
+
self.cur_df
|
186 |
+
.groupby("Model")[self.cur_df.columns]
|
187 |
+
.apply(lambda x: x[x["Avg TPOT (s)"] <= tpot_slo], include_groups=True)
|
188 |
+
.sort_values(by="Energy/req (J)")
|
189 |
+
.reset_index(drop=True)
|
190 |
+
.groupby("Model")
|
191 |
+
.head(1)
|
192 |
+
)
|
193 |
+
|
194 |
+
return self.cur_df
|
195 |
+
|
196 |
+
def get_detail_text(self) -> tuple[str, str]:
|
197 |
+
text = """
|
198 |
+
Columns
|
199 |
+
- **Model**: The name of the model.
|
200 |
+
- **GPU**: Name of the GPU model used for benchmarking.
|
201 |
+
- **Params**: Number of parameters in the model.
|
202 |
+
- **TP**: Tensor parallelism degree.
|
203 |
+
- **PP**: Pipeline parallelism degree. (TP * PP is the total number of GPUs used.)
|
204 |
+
- **Energy/req (J)**: Energy consumed per request in Joules.
|
205 |
+
- **Avg TPOT (s)**: Average time per output token in seconds.
|
206 |
+
- **Token tput (toks/s)**: Average number of tokens generated by the engine per second.
|
207 |
+
- **Avg Output Tokens**: Average number of output tokens in the LLM's response.
|
208 |
+
- **Avg BS**: Average batch size of the serving engine over time.
|
209 |
+
- **Max BS**: Maximum batch size configuration of the serving engine.
|
210 |
+
|
211 |
+
For more detailed information, please take a look at the **About** tab.
|
212 |
+
"""
|
213 |
+
return "markdown", text
|
214 |
+
|
215 |
+
|
216 |
+
class LLMChatTableManager(LLMTableManager):
|
217 |
+
"""LLM table manager for chat tasks."""
|
218 |
+
|
219 |
+
def get_tab_name(self) -> str:
|
220 |
+
return "LLM Chat"
|
221 |
+
|
222 |
+
def get_intro_text(self) -> tuple[str, str]:
|
223 |
+
text = """
|
224 |
+
<h2>How much energy do GenAI models consume?</h2>
|
225 |
+
|
226 |
+
<h3>LLM chatbot response generation</h3>
|
227 |
+
|
228 |
+
<p style="font-size: 16px">
|
229 |
+
We used <a href="https://ml.energy/zeus">Zeus</a> to benchmark various instruction-tuned LLMs in terms of how much time and energy they consume for inference.
|
230 |
+
</p>
|
231 |
+
|
232 |
+
<p style="font-size: 16px">
|
233 |
+
An average Time Per Output Token (TPOT) of 0.20 seconds roughly corresponds to a person reading at 240 words per minute and 1.3 tokens per word.
|
234 |
+
</p>
|
235 |
+
"""
|
236 |
+
return "html", text
|
237 |
+
|
238 |
+
|
239 |
+
class LLMCodeTableManager(LLMTableManager):
|
240 |
+
"""LLM table manager for coding tasks."""
|
241 |
+
|
242 |
+
def get_tab_name(self) -> str:
|
243 |
+
return "LLM Code"
|
244 |
+
|
245 |
+
def get_intro_text(self) -> tuple[str, str]:
|
246 |
+
text = """
|
247 |
+
<h2>How much energy do GenAI models consume?</h2>
|
248 |
+
|
249 |
+
<h3>LLM code generation</h3>
|
250 |
+
|
251 |
+
<p style="font-size: 16px">
|
252 |
+
We used <a href="https://ml.energy/zeus">Zeus</a> to benchmark various LLMs specialized for coding in terms of how much time and energy they consume for inference.
|
253 |
+
</p>
|
254 |
+
|
255 |
+
<p style="font-size: 16px">
|
256 |
+
An average Time Per Output Token (TPOT) of 0.20 seconds roughly corresponds to a person reading at 240 words per minute and 1.3 tokens per word.
|
257 |
+
</p>
|
258 |
+
"""
|
259 |
+
return "html", text
|
260 |
+
|
261 |
+
|
262 |
+
class VLMChatTableManager(LLMTableManager):
|
263 |
+
"""VLM table manager for chat tasks."""
|
264 |
+
|
265 |
+
def get_tab_name(self) -> str:
|
266 |
+
return "VLM Visual Chat"
|
267 |
+
|
268 |
+
def get_intro_text(self) -> tuple[str, str]:
|
269 |
+
text = """
|
270 |
+
<h2>How much energy do GenAI models consume?</h2>
|
271 |
+
|
272 |
+
<h3>VLM visual chatbot response generation</h3>
|
273 |
+
|
274 |
+
<p style="font-size: 16px">
|
275 |
+
We used <a href="https://ml.energy/zeus">Zeus</a> to benchmark various Vision Language Models (VLMs) in terms of how much time and energy they consume for inference.
|
276 |
+
</p>
|
277 |
+
|
278 |
+
<p style="font-size: 16px">
|
279 |
+
A Time Per Output Token (TPOT) of 0.2 seconds roughly corresponds to a person reading at 240 words per minute and 1.3 tokens per word.
|
280 |
+
</p>
|
281 |
+
"""
|
282 |
+
return "html", text
|
283 |
+
|
284 |
+
|
285 |
+
class DiffusionTableManager(TableManager):
|
286 |
+
def __init__(self, data_dir: str, task_name: str) -> None:
|
287 |
+
"""Load leaderboard data from files in `data_dir`.
|
288 |
+
|
289 |
+
Under `data_dir`, there should be:
|
290 |
+
- `models.json`: JSON file that maps huggingface model IDs to model info.
|
291 |
+
Some models listed in this file may not have benchmark results.
|
292 |
+
- `schema.yaml`: YAML file containing the schema of the benchmark.
|
293 |
+
|
294 |
+
Then, benchmark data files are nested under `data_dir` according to the schema.
|
295 |
+
One directory hierarchy for each choice in the schema and then two more -- the
|
296 |
+
model's HuggingFace hub organization and the model name.
|
297 |
+
"""
|
298 |
+
super().__init__(data_dir)
|
299 |
+
|
300 |
+
self.task_name = task_name
|
301 |
+
|
302 |
+
if "to video" in task_name.lower():
|
303 |
+
self.energy_col = "Energy/video (J)"
|
304 |
+
elif "to image" in task_name.lower():
|
305 |
+
self.energy_col = "Energy/image (J)"
|
306 |
+
else:
|
307 |
+
raise ValueError(f"Unknown task name: {task_name=}")
|
308 |
+
|
309 |
+
# Read in the data into a Pandas DataFrame.
|
310 |
+
# Important: The ordering `self.schema` determines the directory structure.
|
311 |
+
self.schema = yaml.safe_load(open(self.data_dir / "schema.yaml"))
|
312 |
+
models: dict[str, dict[str, Any]] = json.load(
|
313 |
+
open(self.data_dir / "models.json")
|
314 |
+
)
|
315 |
+
res_df = pd.DataFrame()
|
316 |
+
for choice in itertools.product(*self.schema.values()):
|
317 |
+
result_dir = self.data_dir / "/".join(choice)
|
318 |
+
with contextlib.suppress(FileNotFoundError):
|
319 |
+
for model_id, model_info in models.items():
|
320 |
+
for file in (result_dir / model_id).glob("*.json"):
|
321 |
+
model_df = pd.DataFrame([json.load(open(file))])
|
322 |
+
# Sanity checks and standardization of schema values.
|
323 |
+
assert model_df["Model"].iloc[0] == model_id
|
324 |
+
for key, val in zip(self.schema.keys(), choice):
|
325 |
+
assert (
|
326 |
+
str(val).lower() in str(model_df[key].iloc[0]).lower()
|
327 |
+
)
|
328 |
+
model_df[key] = val
|
329 |
+
# Format the model name as an HTML anchor.
|
330 |
+
model_df["Model"] = self._wrap_model_name(model_info["url"], model_info["nickname"])
|
331 |
+
model_df["Total params"] = model_info["total_params"]
|
332 |
+
model_df["Denoising params"] = model_info["denoising_params"]
|
333 |
+
model_df["Resolution"] = model_info["resolution"]
|
334 |
+
res_df = pd.concat([res_df, model_df])
|
335 |
+
|
336 |
+
if res_df.empty:
|
337 |
+
raise ValueError(
|
338 |
+
f"No benchmark JSON files were read from {self.data_dir=}."
|
339 |
+
)
|
340 |
+
|
341 |
+
# Order columns
|
342 |
+
columns = res_df.columns.to_list()
|
343 |
+
cols_to_order = ["Model", "Denoising params", "Total params"]
|
344 |
+
cols_to_order.extend(self.schema.keys())
|
345 |
+
columns = cols_to_order + [col for col in columns if col not in cols_to_order]
|
346 |
+
res_df = res_df[columns]
|
347 |
+
|
348 |
+
# Order rows
|
349 |
+
res_df = res_df.sort_values(by=["Model", *self.schema.keys(), self.energy_col])
|
350 |
+
|
351 |
+
self.cur_df = self.full_df = res_df.round(2)
|
352 |
+
|
353 |
+
# We need to set the default view separately when `gr.State` is forked.
|
354 |
+
self.set_filter_get_df()
|
355 |
+
|
356 |
+
def get_benchmark_checkboxes(self) -> dict[str, list[str]]:
|
357 |
+
return self.schema
|
358 |
+
|
359 |
+
def get_all_models(self) -> list[str]:
|
360 |
+
return self.full_df["Model"].apply(self._unwrap_model_name).unique().tolist()
|
361 |
+
|
362 |
+
def set_filter_get_df(self, *filters) -> pd.DataFrame:
|
363 |
+
"""Set the current set of filters and return the filtered DataFrame.
|
364 |
+
|
365 |
+
Filters can either be completely empty, or be a concatenated list of
|
366 |
+
choices from all checkboxes and all sliders.
|
367 |
+
"""
|
368 |
+
# If the filter is empty, we default to the first choice for each key.
|
369 |
+
if not filters:
|
370 |
+
checkboxes = [choices[:1] for choices in self.schema.values()]
|
371 |
+
sliders = [slider[3] for slider in self.get_benchmark_sliders().values()]
|
372 |
+
filters = checkboxes + sliders
|
373 |
+
|
374 |
+
index = np.full(len(self.full_df), True)
|
375 |
+
# Checkboxes
|
376 |
+
for setup, choice in zip(self.schema, filters):
|
377 |
+
index = index & self.full_df[setup].isin(choice)
|
378 |
+
self.cur_df = self.full_df.loc[index]
|
379 |
+
|
380 |
+
# Sliders (We just have Batch latency for now.)
|
381 |
+
# For each `Model`, we want to first filter out rows whose `Batch latency (s)` is greater than the slider value.
|
382 |
+
# Finally, only just leave the row whose `Energy/image (J)` or `Energy/video (J)` is the smallest.
|
383 |
+
batch_latency = filters[-1]
|
384 |
+
self.cur_df = (
|
385 |
+
self.cur_df
|
386 |
+
.groupby("Model")[self.cur_df.columns]
|
387 |
+
.apply(
|
388 |
+
lambda x: x[x["Batch latency (s)"] <= batch_latency],
|
389 |
+
include_groups=True,
|
390 |
+
)
|
391 |
+
.sort_values(by=self.energy_col)
|
392 |
+
.reset_index(drop=True)
|
393 |
+
.groupby("Model")
|
394 |
+
.head(1)
|
395 |
+
)
|
396 |
+
|
397 |
+
return self.cur_df
|
398 |
+
|
399 |
+
|
400 |
+
class DiffusionT2ITableManager(DiffusionTableManager):
|
401 |
+
"""Diffusion table manager for text-to-image tasks."""
|
402 |
+
|
403 |
+
def get_tab_name(self) -> str:
|
404 |
+
return "Diffusion Text to image"
|
405 |
+
|
406 |
+
def get_intro_text(self) -> tuple[str, str]:
|
407 |
+
text = """
|
408 |
+
<h2>Diffusion text-to-image generation</h2></br>
|
409 |
+
|
410 |
+
<p style="font-size: 16px">
|
411 |
+
We used <a href="https://ml.energy/zeus">Zeus</a> to benchmark various open source LLMs in terms of how much time and energy they consume for inference.
|
412 |
+
</p>
|
413 |
+
|
414 |
+
<p style="font-size: 16px">
|
415 |
+
The time and energy consumption of Diffusion models are affected by not only the size of the model, but also the number of denoising steps and the resolution of the generated images.
|
416 |
+
</p>
|
417 |
+
"""
|
418 |
+
return "html", text
|
419 |
+
|
420 |
+
def get_detail_text(self) -> tuple[str, str]:
|
421 |
+
text = """
|
422 |
+
Columns
|
423 |
+
- **Model**: The name of the model.
|
424 |
+
- **Denoising params**: Number of parameters in the denosing module (e.g., UNet, Transformer).
|
425 |
+
- **Total params**: Total number of parameters in the model, including encoders and decoders.
|
426 |
+
- **GPU**: Name of the GPU model used for benchmarking.
|
427 |
+
- **Energy/image (J)**: Energy consumed per generated image in Joules.
|
428 |
+
- **Batch latency (s)**: Time taken to generate a batch of images in seconds.
|
429 |
+
- **Batch size**: Number of prompts/images in a batch.
|
430 |
+
- **Denoising steps**: Number of denoising steps used for the diffusion model.
|
431 |
+
- **Resolution**: Resolution of the generated image.
|
432 |
+
|
433 |
+
For more detailed information, please take a look at the **About** tab.
|
434 |
+
"""
|
435 |
+
return "markdown", text
|
436 |
+
|
437 |
+
def get_benchmark_sliders(self) -> dict[str, tuple[float, float, float, float]]:
|
438 |
+
return {"Batch latency (s)": (0.0, 60.0, 1.0, 10.0)}
|
439 |
+
|
440 |
+
|
441 |
+
class DiffusionT2VTableManager(DiffusionTableManager):
|
442 |
+
"""Diffusion table manager for text-to-video tasks."""
|
443 |
+
|
444 |
+
def get_tab_name(self) -> str:
|
445 |
+
return "Diffusion Text to video"
|
446 |
+
|
447 |
+
def get_intro_text(self) -> tuple[str, str]:
|
448 |
+
text = """
|
449 |
+
<h2>Diffusion text-to-video generation</h2></br>
|
450 |
+
|
451 |
+
<p style="font-size: 16px">
|
452 |
+
We used <a href="https://ml.energy/zeus">Zeus</a> to benchmark various open source LLMs in terms of how much time and energy they consume for inference.
|
453 |
+
</p>
|
454 |
+
|
455 |
+
<p style="font-size: 16px">
|
456 |
+
The time and energy consumption of Diffusion models are affected by not only the size of the model, but also the number of denoising steps, the resolution of the generated video, and the total number of frames in the video.
|
457 |
+
</p>
|
458 |
+
"""
|
459 |
+
return "html", text
|
460 |
+
|
461 |
+
def get_detail_text(self) -> tuple[str, str]:
|
462 |
+
text = """
|
463 |
+
Columns
|
464 |
+
- **Model**: The name of the model.
|
465 |
+
- **Denoising params**: Number of parameters in the denosing module (e.g., UNet, Transformer).
|
466 |
+
- **Total params**: Total number of parameters in the model, including encoders and decoders.
|
467 |
+
- **GPU**: Name of the GPU model used for benchmarking.
|
468 |
+
- **Energy/video (J)**: Energy consumed per generated video in Joules.
|
469 |
+
- **Batch latency (s)**: Time taken to generate a batch of videos in seconds.
|
470 |
+
- **Batch size**: Number of prompts/videos in a batch.
|
471 |
+
- **Denoising steps**: Number of denoising steps used for the diffusion model.
|
472 |
+
- **Frames**: Number of frames in the generated video.
|
473 |
+
- **Resolution**: Resolution of the generated video.
|
474 |
+
|
475 |
+
For more detailed information, please take a look at the **About** tab.
|
476 |
+
"""
|
477 |
+
return "markdown", text
|
478 |
+
|
479 |
+
def get_benchmark_sliders(self) -> dict[str, tuple[float, float, float, float]]:
|
480 |
+
return {"Batch latency (s)": (0.0, 60.0, 1.0, 10.0)}
|
481 |
+
|
482 |
+
|
483 |
+
class DiffusionI2VTableManager(DiffusionTableManager):
|
484 |
+
"""Diffusion table manager for image-to-video tasks."""
|
485 |
+
|
486 |
+
def get_tab_name(self) -> str:
|
487 |
+
return "Diffusion Image to video"
|
488 |
+
|
489 |
+
def get_intro_text(self) -> tuple[str, str]:
|
490 |
+
text = """
|
491 |
+
<h2>Diffusion image-to-video generation</h2></br>
|
492 |
+
|
493 |
+
<p style="font-size: 16px">
|
494 |
+
We used <a href="https://ml.energy/zeus">Zeus</a> to benchmark various open source LLMs in terms of how much time and energy they consume for inference.
|
495 |
+
</p>
|
496 |
+
|
497 |
+
<p style="font-size: 16px">
|
498 |
+
The time and energy consumption of Diffusion models are affected by not only the size of the model, but also the number of denoising steps, the resolution of the generated video, and the total number of frames in the video.
|
499 |
+
</p>
|
500 |
+
"""
|
501 |
+
return "html", text
|
502 |
+
|
503 |
+
def get_detail_text(self) -> tuple[str, str]:
|
504 |
+
text = """
|
505 |
+
Columns
|
506 |
+
- **Model**: The name of the model.
|
507 |
+
- **Denoising params**: Number of parameters in the denosing module (e.g., UNet, Transformer).
|
508 |
+
- **Total params**: Total number of parameters in the model, including encoders and decoders.
|
509 |
+
- **GPU**: Name of the GPU model used for benchmarking.
|
510 |
+
- **Energy/video (J)**: Energy consumed per generated video in Joules.
|
511 |
+
- **Batch latency (s)**: Time taken to generate a batch of videos in seconds.
|
512 |
+
- **Batch size**: Number of prompts/videos in a batch.
|
513 |
+
- **Denoising steps**: Number of denoising steps used for the diffusion model.
|
514 |
+
- **Frames**: Number of frames in the generated video.
|
515 |
+
- **Resolution**: Resolution of the generated video.
|
516 |
+
|
517 |
+
For more detailed information, please take a look at the **About** tab.
|
518 |
+
"""
|
519 |
+
return "markdown", text
|
520 |
+
|
521 |
+
def get_benchmark_sliders(self) -> dict[str, tuple[float, float, float, float]]:
|
522 |
+
return {"Batch latency (s)": (0.0, 120.0, 1.0, 45.0)}
|
523 |
+
|
524 |
+
|
525 |
+
class LegacyTableManager:
|
526 |
def __init__(self, data_dir: str) -> None:
|
527 |
+
"""Load the legacy LLM leaderboard data from CSV files in data_dir.
|
528 |
|
529 |
Inside `data_dir`, there should be:
|
530 |
- `models.json`: a JSON file containing information about each model.
|
|
|
553 |
f'<a style="text-decoration: underline; text-decoration-style: dotted" '
|
554 |
f'target="_blank" href="{url}">{nickname}</a>'
|
555 |
)
|
556 |
+
|
557 |
df["model"] = df["model"].apply(format_model_link)
|
558 |
|
559 |
# Sort by our 'energy efficiency' score.
|
|
|
606 |
"""Formats into HTML that prints in Monospace font."""
|
607 |
return f"<pre style='font-family: monospace'>{text}</pre>"
|
608 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
609 |
def get_dropdown(self):
|
610 |
columns = self.full_df.columns.tolist()[1:]
|
611 |
return [
|
|
|
635 |
self.cur_index = index
|
636 |
return self.cur_df
|
637 |
|
638 |
+
def get_intro_text(self) -> str:
|
639 |
+
"""Return the leaderboard's introduction text in HTML."""
|
640 |
+
return """
|
641 |
+
<div align="center">
|
642 |
+
<h2 style="color: #23d175">This is the legacy ML.ENERGY LLM leaderboard. This will be removed by the end of the year.</h2>
|
643 |
+
</div>
|
644 |
+
|
645 |
+
<h3>How much energy do modern Large Language Models (LLMs) consume for inference?</h3>
|
646 |
+
|
647 |
+
<p style="font-size: 16px">
|
648 |
+
We used <a href="https://ml.energy/zeus">Zeus</a> to benchmark various open source LLMs in terms of how much time and energy they consume for inference.
|
649 |
+
</p>
|
650 |
+
|
651 |
+
<p style="font-size: 16px">
|
652 |
+
For more detailed information, please take a look at the <b>About</b> tab.
|
653 |
+
Every benchmark is limited in some sense -- Before you interpret the results, please take a look at the <b>Limitations</b> section there, too.
|
654 |
+
</p>
|
655 |
+
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
656 |
|
|
|
657 |
|
658 |
# The global instance of the TableManager should only be used when
|
659 |
# initializing components in the Gradio interface. If the global instance
|
660 |
# is mutated while handling user sessions, the change will be reflected
|
661 |
# in every user session. Instead, the instance provided by gr.State should
|
662 |
# be used.
|
663 |
+
global_ltbm = LegacyTableManager("data/legacy")
|
664 |
+
global_tbms = [
|
665 |
+
LLMChatTableManager("data/llm_text_generation/chat", "Chat"),
|
666 |
+
LLMCodeTableManager("data/llm_text_generation/code", "Code"),
|
667 |
+
VLMChatTableManager("data/mllm_text_generation/chat", "Visual chat"),
|
668 |
+
DiffusionT2ITableManager("data/diffusion/text-to-image", "Text to image"),
|
669 |
+
DiffusionT2VTableManager("data/diffusion/text-to-video", "Text to video"),
|
670 |
+
DiffusionI2VTableManager("data/diffusion/image-to-video", "Image to video"),
|
671 |
+
]
|
|
|
|
|
672 |
|
673 |
# Custom JS.
|
674 |
# XXX: This is a hack to make the model names clickable.
|
|
|
682 |
dataframe_update_js = f"""
|
683 |
function format_model_link() {{
|
684 |
// Iterate over the cells of the first column of the leaderboard table.
|
685 |
+
var table_element = document.querySelectorAll(".tab-leaderboard");
|
686 |
+
for (var table of table_element) {{
|
687 |
+
for (let index = 1; index <= {len(global_ltbm.full_df) + sum(len(tbm.full_df) for tbm in global_tbms)}; index++) {{
|
688 |
+
// Get the cell from `table`.
|
689 |
+
var cell = table.querySelector(`div > div > div > table > tbody > tr:nth-child(${{index}}) > td:nth-child(1) > div > span`);
|
690 |
+
// var cell = document.querySelector(
|
691 |
+
// `.tab-leaderboard > div > div > div > table > tbody > tr:nth-child(${{index}}) > td:nth-child(1) > div > span`
|
692 |
+
// );
|
693 |
|
694 |
// If nothing was found, it likely means that now the visible table has less rows
|
695 |
// than the full table. This happens when the user filters the table. In this case,
|
|
|
713 |
// Replace the innerHTML of the cell with the interpreted HTML.
|
714 |
cell.replaceChildren(model_anchor);
|
715 |
}}
|
716 |
+
}}
|
717 |
|
718 |
// Return all arguments as is.
|
719 |
return arguments
|
|
|
797 |
}
|
798 |
"""
|
799 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
800 |
# The app will not start without a controller address set.
|
801 |
controller_addr = os.environ.get("COLOSSEUM_CONTROLLER_ADDR")
|
802 |
if controller_addr is None:
|
803 |
COLOSSEUM_UP = False
|
804 |
+
COLOSSEUM_DOWN_MESSAGE = "<br/><h2 style='text-align: center'>Local testing mode. Colosseum disabled.</h2>"
|
805 |
controller_addr = "localhost"
|
806 |
global_controller_client = ControllerClient(controller_addr=controller_addr, timeout=15)
|
807 |
|
808 |
+
# Fetch the latest update date of the leaderboard repository.
|
809 |
+
resp = requests.get("https://api.github.com/repos/ml-energy/leaderboard/commits/master")
|
810 |
+
if resp.status_code != 200:
|
811 |
+
current_date = "[Failed to fetch]"
|
812 |
+
print("Failed to fetch the latest release date of the leaderboard repository.")
|
813 |
+
print(resp.json())
|
814 |
+
else:
|
815 |
+
current_datetime = parser.parse(resp.json()["commit"]["author"]["date"])
|
816 |
+
current_date = current_datetime.astimezone(tz.gettz("US/Eastern")).strftime(
|
817 |
+
"%Y-%m-%d"
|
818 |
+
)
|
819 |
+
|
820 |
# Load the list of models. To reload, the app should be restarted.
|
821 |
RANDOM_MODEL_NAME = "Random"
|
822 |
RANDOM_USER_PREFERENCE = "Two random models"
|
|
|
825 |
model_name_to_user_pref[RANDOM_MODEL_NAME] = RANDOM_USER_PREFERENCE
|
826 |
user_pref_to_model_name = {v: k for k, v in model_name_to_user_pref.items()}
|
827 |
|
828 |
+
|
829 |
# Colosseum helper functions.
|
830 |
+
def enable_interact(num: int):
|
831 |
+
def inner():
|
832 |
+
return [gr.update(interactive=True)] * num
|
833 |
+
return inner
|
834 |
+
|
835 |
+
|
836 |
+
def disable_interact(num: int):
|
837 |
+
def inner():
|
838 |
+
return [gr.update(interactive=False)] * num
|
839 |
+
return inner
|
840 |
|
|
|
|
|
841 |
|
842 |
def consumed_less_energy_message(energy_a, energy_b):
|
843 |
"""Return a message that indicates that the user chose the model that consumed less energy.
|
|
|
850 |
how_much = f"{1 / factor:.1f}x" if factor <= 0.5 else f"{100 - factor * 100:.1f}%"
|
851 |
return f"<h2>That response also <span class='green-text'>consumed {how_much} less energy</span> ({energy_a:,.0f} J vs. {energy_b:,.0f} J)!</h2>"
|
852 |
|
853 |
+
|
854 |
def consumed_more_energy_message(energy_a, energy_b):
|
855 |
"""Return a message that indicates that the user chose the model that consumed more energy.
|
856 |
|
|
|
862 |
how_much = f"{factor:.1f}x" if factor >= 2.0 else f"{factor * 100 - 100:.1f}%"
|
863 |
return f"<h2>That response <span class='red-text'>consumed {how_much} more energy</span> ({energy_a:,.0f} J vs. {energy_b:,.0f} J).</h2>"
|
864 |
|
865 |
+
|
866 |
# Colosseum event handlers
|
867 |
def on_load():
|
868 |
"""Intialize the dataframe, shuffle the model preference dropdown choices."""
|
869 |
+
dataframe = global_ltbm.set_filter_get_df()
|
870 |
+
dataframes = [global_tbm.set_filter_get_df() for global_tbm in global_tbms]
|
871 |
available_models = copy.deepcopy(global_available_models)
|
872 |
random.shuffle(available_models)
|
873 |
available_models.insert(0, RANDOM_MODEL_NAME)
|
874 |
+
return (
|
875 |
+
dataframe,
|
876 |
+
*dataframes,
|
877 |
+
gr.Dropdown.update(
|
878 |
+
choices=[model_name_to_user_pref[model] for model in available_models]
|
879 |
+
),
|
880 |
+
)
|
881 |
+
|
882 |
|
883 |
def add_prompt_disable_submit(prompt, history_a, history_b):
|
884 |
"""Add the user's prompt to the two model's history and disable further submission."""
|
|
|
892 |
client,
|
893 |
]
|
894 |
|
895 |
+
|
896 |
def generate_responses(client: ControllerClient, user_preference, history_a, history_b):
|
897 |
"""Generate responses for the two models."""
|
898 |
model_preference = user_pref_to_model_name[user_preference]
|
899 |
for resp_a, resp_b in itertools.zip_longest(
|
900 |
+
client.prompt(
|
901 |
+
prompt=history_a[-1][0], index=0, model_preference=model_preference
|
902 |
+
),
|
903 |
+
client.prompt(
|
904 |
+
prompt=history_b[-1][0], index=1, model_preference=model_preference
|
905 |
+
),
|
906 |
):
|
907 |
if resp_a is not None:
|
908 |
history_a[-1][1] += resp_a
|
|
|
910 |
history_b[-1][1] += resp_b
|
911 |
yield [history_a, history_b]
|
912 |
|
913 |
+
|
914 |
def make_resp_vote_func(victory_index: Literal[0, 1]):
|
915 |
"""Return a function that will be called when the user clicks on response preference vote buttons."""
|
916 |
+
|
917 |
def resp_vote_func(client: ControllerClient):
|
918 |
vote_response = client.response_vote(victory_index=victory_index)
|
919 |
model_name_a, model_name_b = map(lambda n: f"## {n}", vote_response.model_names)
|
|
|
948 |
# Keep the reset button disabled
|
949 |
gr.Button.update(visible=False, interactive=False),
|
950 |
]
|
951 |
+
|
952 |
return resp_vote_func
|
953 |
|
954 |
+
|
955 |
def make_energy_vote_func(is_worth: bool):
|
956 |
"""Return a function that will be called when the user clicks on energy vote buttons."""
|
957 |
+
|
958 |
def energy_vote_func(client: ControllerClient, energy_message: str):
|
959 |
vote_response = client.energy_vote(is_worth=is_worth)
|
960 |
model_name_a, model_name_b = map(lambda n: f"## {n}", vote_response.model_names)
|
|
|
968 |
# Append to the energy comparison message
|
969 |
energy_message[:-5] + (" Fair enough.</h2>" if is_worth else " Wasn't worth it.</h2>"),
|
970 |
]
|
971 |
+
|
972 |
return energy_vote_func
|
973 |
|
974 |
+
|
975 |
def play_again():
|
976 |
available_models = copy.deepcopy(global_available_models)
|
977 |
random.shuffle(available_models)
|
|
|
986 |
# Hide energy vote buttons and message
|
987 |
gr.Button.update(visible=False), gr.Button.update(visible=False), gr.Markdown.update(visible=False),
|
988 |
# Enable model preference dropdown and shuffle choices
|
989 |
+
gr.Dropdown.update(
|
990 |
+
value=RANDOM_USER_PREFERENCE,
|
991 |
+
choices=[model_name_to_user_pref[model] for model in available_models],
|
992 |
+
interactive=True,
|
993 |
+
),
|
994 |
# Disable reset button
|
995 |
gr.Button.update(interactive=False, visible=False),
|
996 |
]
|
997 |
|
998 |
+
|
999 |
focus_prompt_input_js = """
|
1000 |
function() {
|
1001 |
for (let textarea of document.getElementsByTagName("textarea")) {
|
|
|
1008 |
"""
|
1009 |
|
1010 |
with gr.Blocks(css=custom_css) as block:
|
1011 |
+
tbm = gr.State(global_ltbm) # type: ignore
|
1012 |
+
local_tbms: list[TableManager] = [gr.State(global_tbm) for global_tbm in global_tbms] # type: ignore
|
1013 |
+
|
1014 |
with gr.Box():
|
1015 |
+
gr.HTML(
|
1016 |
+
"<h1><a href='https://ml.energy' class='text-logo'>ML.ENERGY</a> Leaderboard</h1>"
|
1017 |
+
)
|
1018 |
|
1019 |
with gr.Tabs():
|
1020 |
# Tab: Colosseum.
|
1021 |
+
with gr.Tab("Colosseum ⚔️️"):
|
1022 |
if COLOSSEUM_UP:
|
1023 |
gr.Markdown(open("docs/colosseum_top.md").read())
|
1024 |
else:
|
|
|
1058 |
resp_vote_btn_list: list[gr.component.Component] = []
|
1059 |
with gr.Column():
|
1060 |
with gr.Row():
|
1061 |
+
masked_model_names.append(
|
1062 |
+
gr.Markdown(visible=False, elem_classes=["model-name-text"])
|
1063 |
+
)
|
1064 |
with gr.Row():
|
1065 |
+
chatbots.append(
|
1066 |
+
gr.Chatbot(
|
1067 |
+
label="Model A",
|
1068 |
+
elem_id="chatbot",
|
1069 |
+
height=400,
|
1070 |
+
elem_classes=None if COLOSSEUM_UP else ["greyed-out"],
|
1071 |
+
)
|
1072 |
+
)
|
1073 |
with gr.Row():
|
1074 |
+
left_resp_vote_btn = gr.Button(
|
1075 |
+
value="👈 Model A is better", interactive=False
|
1076 |
+
)
|
1077 |
resp_vote_btn_list.append(left_resp_vote_btn)
|
1078 |
|
1079 |
with gr.Column():
|
1080 |
with gr.Row():
|
1081 |
+
masked_model_names.append(
|
1082 |
+
gr.Markdown(visible=False, elem_classes=["model-name-text"])
|
1083 |
+
)
|
1084 |
with gr.Row():
|
1085 |
+
chatbots.append(
|
1086 |
+
gr.Chatbot(
|
1087 |
+
label="Model B",
|
1088 |
+
elem_id="chatbot",
|
1089 |
+
height=400,
|
1090 |
+
elem_classes=None if COLOSSEUM_UP else ["greyed-out"],
|
1091 |
+
)
|
1092 |
+
)
|
1093 |
with gr.Row():
|
1094 |
+
right_resp_vote_btn = gr.Button(
|
1095 |
+
value="👉 Model B is better", interactive=False
|
1096 |
+
)
|
1097 |
resp_vote_btn_list.append(right_resp_vote_btn)
|
1098 |
|
1099 |
with gr.Row():
|
1100 |
energy_comparison_message = gr.HTML(visible=False)
|
1101 |
|
1102 |
with gr.Row():
|
1103 |
+
worth_energy_vote_btn = gr.Button(
|
1104 |
+
value="The better response was worth 👍 the extra energy.",
|
1105 |
+
visible=False,
|
1106 |
+
)
|
1107 |
+
notworth_energy_vote_btn = gr.Button(
|
1108 |
+
value="Not really worth that much more. 👎", visible=False
|
1109 |
+
)
|
1110 |
+
energy_vote_btn_list: list[gr.component.Component] = [
|
1111 |
+
worth_energy_vote_btn,
|
1112 |
+
notworth_energy_vote_btn,
|
1113 |
+
]
|
1114 |
|
1115 |
with gr.Row():
|
1116 |
+
play_again_btn = gr.Button(
|
1117 |
+
"Play again!", visible=False, elem_classes=["btn-submit"]
|
1118 |
+
)
|
1119 |
|
1120 |
gr.Markdown(open("docs/colosseum_bottom.md").read())
|
1121 |
|
|
|
1125 |
(prompt_input
|
1126 |
.submit(add_prompt_disable_submit, [prompt_input, *chatbots], [prompt_input, prompt_submit_btn, model_preference_dropdown, *chatbots, controller_client], queue=False)
|
1127 |
.then(generate_responses, [controller_client, model_preference_dropdown, *chatbots], [*chatbots], queue=True, show_progress="hidden")
|
1128 |
+
.then(enable_interact(2), None, resp_vote_btn_list, queue=False))
|
1129 |
(prompt_submit_btn
|
1130 |
.click(add_prompt_disable_submit, [prompt_input, *chatbots], [prompt_input, prompt_submit_btn, model_preference_dropdown, *chatbots, controller_client], queue=False)
|
1131 |
.then(generate_responses, [controller_client, model_preference_dropdown, *chatbots], [*chatbots], queue=True, show_progress="hidden")
|
1132 |
+
.then(enable_interact(2), None, resp_vote_btn_list, queue=False))
|
1133 |
|
1134 |
left_resp_vote_btn.click(
|
1135 |
make_resp_vote_func(victory_index=0),
|
|
|
1166 |
)
|
1167 |
.then(None, _js=focus_prompt_input_js, queue=False))
|
1168 |
|
1169 |
+
# Tab: Leaderboards.
|
1170 |
+
dataframes = []
|
1171 |
+
for global_tbm, local_tbm in zip(global_tbms, local_tbms):
|
1172 |
+
with gr.Tab(global_tbm.get_tab_name()):
|
1173 |
+
# Box: Introduction text.
|
1174 |
+
with gr.Box():
|
1175 |
+
intro_text_type, intro_text = global_tbm.get_intro_text()
|
1176 |
+
if intro_text_type not in ["markdown", "html"]:
|
1177 |
+
raise ValueError(f"Invalid text type '{intro_text_type}' from {local_tbm}")
|
1178 |
+
if intro_text_type == "markdown":
|
1179 |
+
gr.Markdown(intro_text)
|
1180 |
+
else:
|
1181 |
+
gr.HTML(intro_text)
|
1182 |
+
|
1183 |
+
# Block: Checkboxes and sliders to select benchmarking parameters.
|
1184 |
+
with gr.Row():
|
1185 |
+
checkboxes: list[gr.CheckboxGroup] = []
|
1186 |
+
for key, choices in global_tbm.get_benchmark_checkboxes().items():
|
1187 |
+
# Check the first element by default.
|
1188 |
+
checkboxes.append(gr.CheckboxGroup(choices=choices, value=choices[:1], label=key))
|
1189 |
+
|
1190 |
+
sliders: list[gr.Slider] = []
|
1191 |
+
for key, (min_val, max_val, step, default) in global_tbm.get_benchmark_sliders().items():
|
1192 |
+
sliders.append(gr.Slider(minimum=min_val, maximum=max_val, value=default, step=step, label=key))
|
1193 |
+
|
1194 |
+
# Block: Leaderboard table.
|
1195 |
+
with gr.Row():
|
1196 |
+
dataframe = gr.Dataframe(
|
1197 |
+
type="pandas",
|
1198 |
+
elem_classes=["tab-leaderboard"],
|
1199 |
+
interactive=False,
|
1200 |
+
)
|
1201 |
+
dataframes.append(dataframe)
|
1202 |
|
1203 |
+
# Make sure the models have clickable links.
|
1204 |
+
dataframe.change(
|
1205 |
+
None, None, None, _js=dataframe_update_js, queue=False
|
1206 |
+
)
|
1207 |
+
# Table automatically updates when users check or uncheck any checkbox or move any slider.
|
1208 |
+
for element in [*checkboxes, *sliders]:
|
1209 |
+
element.change(
|
1210 |
+
global_tbm.__class__.set_filter_get_df,
|
1211 |
+
inputs=[local_tbm, *checkboxes, *sliders],
|
1212 |
+
outputs=dataframe,
|
1213 |
+
queue=False,
|
1214 |
+
)
|
1215 |
+
|
1216 |
+
# Block: More details about the leaderboard.
|
1217 |
+
with gr.Box():
|
1218 |
+
detail_text_type, detail_text = global_tbm.get_detail_text()
|
1219 |
+
if detail_text_type not in ["markdown", "html"]:
|
1220 |
+
raise ValueError(f"Invalid text type '{detail_text_type}' from {local_tbm}")
|
1221 |
+
if detail_text_type == "markdown":
|
1222 |
+
gr.Markdown(detail_text)
|
1223 |
+
else:
|
1224 |
+
gr.HTML(detail_text)
|
1225 |
+
|
1226 |
+
# Block: Leaderboard date.
|
1227 |
+
with gr.Row():
|
1228 |
+
gr.HTML(
|
1229 |
+
f"<h3 style='color: gray'>Last updated: {current_date}</h3>"
|
1230 |
+
)
|
1231 |
+
|
1232 |
+
# Tab: Legacy leaderboard.
|
1233 |
+
with gr.Tab("LLM Leaderboard (legacy)"):
|
1234 |
with gr.Box():
|
1235 |
+
gr.HTML(global_ltbm.get_intro_text())
|
1236 |
|
1237 |
# Block: Checkboxes to select benchmarking parameters.
|
1238 |
with gr.Row():
|
1239 |
with gr.Box():
|
1240 |
gr.Markdown("### Benchmark results to show")
|
1241 |
checkboxes: list[gr.CheckboxGroup] = []
|
1242 |
+
for key, choices in global_ltbm.schema.items():
|
1243 |
# Specifying `value` makes everything checked by default.
|
1244 |
+
checkboxes.append(
|
1245 |
+
gr.CheckboxGroup(
|
1246 |
+
choices=choices, value=choices[:1], label=key
|
1247 |
+
)
|
1248 |
+
)
|
1249 |
|
1250 |
# Block: Leaderboard table.
|
1251 |
with gr.Row():
|
1252 |
+
dataframe = gr.Dataframe(
|
1253 |
+
type="pandas", elem_classes=["tab-leaderboard"], interactive=False
|
1254 |
+
)
|
1255 |
# Make sure the models have clickable links.
|
1256 |
dataframe.change(None, None, None, _js=dataframe_update_js, queue=False)
|
1257 |
# Table automatically updates when users check or uncheck any checkbox.
|
1258 |
for checkbox in checkboxes:
|
1259 |
+
checkbox.change(
|
1260 |
+
LegacyTableManager.set_filter_get_df,
|
1261 |
+
inputs=[tbm, *checkboxes],
|
1262 |
+
outputs=dataframe,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1263 |
queue=False,
|
1264 |
)
|
1265 |
|
|
|
1269 |
|
1270 |
# Tab: About page.
|
1271 |
with gr.Tab("About"):
|
1272 |
+
gr.Markdown(open("docs/about.md").read())
|
|
|
1273 |
|
1274 |
# Citation
|
1275 |
with gr.Accordion("📚 Citation", open=False, elem_id="citation-header"):
|
|
|
1283 |
)
|
1284 |
|
1285 |
# Load the table on page load.
|
1286 |
+
block.load(
|
1287 |
+
on_load,
|
1288 |
+
outputs=[dataframe, *dataframes, model_preference_dropdown],
|
1289 |
+
queue=False,
|
1290 |
+
)
|
1291 |
|
1292 |
|
1293 |
if __name__ == "__main__":
|
1294 |
parser = argparse.ArgumentParser()
|
1295 |
+
parser.add_argument(
|
1296 |
+
"--share", action="store_true", help="Specify if sharing is enabled"
|
1297 |
+
)
|
1298 |
parser.add_argument("--concurrency", type=int, default=50)
|
1299 |
|
1300 |
args = parser.parse_args()
|
1301 |
+
block.queue(concurrency_count=args.concurrency, api_open=False).launch(
|
1302 |
+
share=args.share, show_error=True
|
1303 |
+
)
|
benchmark/.gitignore
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
**/results/
|
benchmark/README.md
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# ML.ENERGY Leaderboard Benchmark Suite
|
2 |
+
|
3 |
+
```
|
4 |
+
benchmark/
|
5 |
+
├── common/
|
6 |
+
├── diffusion/
|
7 |
+
│ └── text-to-image/
|
8 |
+
└── llm_text_generation/
|
9 |
+
├── chat/
|
10 |
+
└── code/
|
11 |
+
```
|
12 |
+
|
13 |
+
The `common` directory is for utilities that are common to all benchmarking tasks.
|
14 |
+
Other than that, there is one directory for each type of model and subdirectories for more specific tasks.
|
benchmark/common/download_weights.sh
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env bash
|
2 |
+
|
3 |
+
QUEUE_FILE="$1"
|
4 |
+
|
5 |
+
for model in $(tail -n +4 $QUEUE_FILE | awk '{print $2}'); do
|
6 |
+
HF_HOME=/data/leaderboard/hfcache huggingface-cli download $model --revision $(cat models/$model/revision.txt)
|
7 |
+
done
|
benchmark/common/start_nvml_container.sh
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env bash
|
2 |
+
|
3 |
+
docker run -dit --gpus all --cap-add SYS_ADMIN --name nvml nvidia/cuda:12.3.1-base-ubuntu22.04 bash
|
benchmark/diffusion/image-to-video/.dockerignore
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
README.md
|
benchmark/diffusion/image-to-video/Dockerfile
ADDED
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
FROM nvidia/cuda:12.1.0-base-ubuntu22.04
|
2 |
+
|
3 |
+
# Basic installs
|
4 |
+
ARG DEBIAN_FRONTEND=noninteractive
|
5 |
+
ENV TZ='America/Detroit'
|
6 |
+
RUN apt-get update -qq \
|
7 |
+
&& apt-get -y --no-install-recommends install python3-pip \
|
8 |
+
&& apt-get clean all \
|
9 |
+
&& rm -r /var/lib/apt/lists/*
|
10 |
+
|
11 |
+
# HuggingFace cache dir
|
12 |
+
ENV HF_HOME=/root/.cache/huggingface
|
13 |
+
|
14 |
+
# Copy over benchmark suite and install dependencies
|
15 |
+
ADD . /workspace/image-to-video
|
16 |
+
WORKDIR /workspace/image-to-video
|
17 |
+
RUN pip install -r requirements.txt
|
18 |
+
|
19 |
+
# Benchmark script to run
|
20 |
+
ENTRYPOINT ["python3", "scripts/benchmark_one_datapoint.py"]
|
benchmark/diffusion/image-to-video/README.md
ADDED
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Diffusion model (Image to Video)
|
2 |
+
|
3 |
+
This benchmark suite benchmarks diffusion models with the image-to-video task.
|
4 |
+
|
5 |
+
## Setup
|
6 |
+
|
7 |
+
### Docker images
|
8 |
+
|
9 |
+
```sh
|
10 |
+
docker build -t mlenergy/leaderboard:diffusion-i2v .
|
11 |
+
```
|
12 |
+
|
13 |
+
### HuggingFace cache directory
|
14 |
+
|
15 |
+
The scripts assume the HuggingFace cache directory will be under `/data/leaderboard/hfcache` on the node that runs this benchmark.
|
16 |
+
|
17 |
+
## Benchmarking
|
18 |
+
|
19 |
+
### Obtaining one datapoint
|
20 |
+
|
21 |
+
The Docker image we've build runs `python scripts/benchmark_one_datapoint.py` as its `ENTRYPOINT`.
|
22 |
+
|
23 |
+
```sh
|
24 |
+
docker run \
|
25 |
+
--gpus '"device=0"' \
|
26 |
+
--cap-add SYS_ADMIN \
|
27 |
+
-v /data/leaderboard/hfcache:/root/.cache/huggingface
|
28 |
+
-v $(pwd):/workspace/image-to-video \
|
29 |
+
mlenergy/leaderboard:diffusion-i2v \
|
30 |
+
--result-root results \
|
31 |
+
--batch-size 2 \
|
32 |
+
--power-limit 300 \
|
33 |
+
--save-every 5 \
|
34 |
+
--model ali-vilab/i2vgen-xl \
|
35 |
+
--dataset-path sharegpt4video/sharegpt4video_100.json \
|
36 |
+
--add-text-prompt \
|
37 |
+
--num-frames 16 \
|
38 |
+
--fps 16 \
|
39 |
+
--huggingface-token $HF_TOKEN
|
40 |
+
```
|
41 |
+
|
42 |
+
### Obtaining all datapoints for a single model
|
43 |
+
|
44 |
+
Export your HuggingFace hub token as environment variable `$HF_TOKEN`.
|
45 |
+
|
46 |
+
Run `scripts/benchmark_one_model.py`.
|
47 |
+
|
48 |
+
### Running the entire suite with Pegasus
|
49 |
+
|
50 |
+
You can use [`pegasus`](https://github.com/jaywonchung/pegasus) to run the entire benchmark suite.
|
51 |
+
Queue and host files are in [`./pegasus`](./pegasus).
|
benchmark/diffusion/image-to-video/models/ali-vilab/i2vgen-xl/kwargs.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"torch_dtype": "torch.float16",
|
3 |
+
"variant": "fp16"
|
4 |
+
}
|
benchmark/diffusion/image-to-video/models/ali-vilab/i2vgen-xl/revision.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
39e1979ea27be737b0278c06755e321f2b4360d5
|
benchmark/diffusion/image-to-video/models/stabilityai/stable-video-diffusion-img2vid-xt/kwargs.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"torch_dtype": "torch.float16",
|
3 |
+
"variant": "fp16"
|
4 |
+
}
|
benchmark/diffusion/image-to-video/models/stabilityai/stable-video-diffusion-img2vid-xt/revision.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
9e43909513c6714f1bc78bcb44d96e733cd242aa
|
benchmark/diffusion/image-to-video/models/stabilityai/stable-video-diffusion-img2vid/kwargs.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"torch_dtype": "torch.float16",
|
3 |
+
"variant": "fp16"
|
4 |
+
}
|
benchmark/diffusion/image-to-video/models/stabilityai/stable-video-diffusion-img2vid/revision.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
9cf024d5bfa8f56622af86c884f26a52f6676f2e
|
benchmark/diffusion/image-to-video/pegasus/A100/hosts_1gpu.yaml
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
- hostname:
|
2 |
+
- localhost
|
3 |
+
gpu:
|
4 |
+
- 0
|
5 |
+
- 1
|
6 |
+
- 2
|
7 |
+
- 3
|
8 |
+
- 4
|
9 |
+
- 5
|
10 |
+
- 6
|
11 |
+
- 7
|
benchmark/diffusion/image-to-video/pegasus/A100/queue_1gpu.yaml
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
- command:
|
2 |
+
- "python scripts/benchmark_one_model.py {{ model }} --result-root results/joule --dataset-path sharegpt4video/sharegpt4video_100.json --gpu-ids {{ gpu }} --batch-sizes 8 4 2 1 --power-limits 400 --num-inference-steps 25"
|
3 |
+
model:
|
4 |
+
- '--model ali-vilab/i2vgen-xl --num-frames 16 --add-text-prompt'
|
5 |
+
- '--model stabilityai/stable-video-diffusion-img2vid --num-frames 14'
|
6 |
+
- '--model stabilityai/stable-video-diffusion-img2vid-xt --num-frames 25'
|
benchmark/diffusion/image-to-video/pegasus/H100/hosts_1gpu.yaml
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
- hostname:
|
2 |
+
- localhost
|
3 |
+
gpu:
|
4 |
+
- 0
|
5 |
+
- 1
|
6 |
+
- 2
|
7 |
+
- 3
|
8 |
+
- 4
|
9 |
+
- 5
|
10 |
+
- 6
|
11 |
+
- 7
|
benchmark/diffusion/image-to-video/pegasus/H100/queue_1gpu.yaml
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
- command:
|
2 |
+
- "python scripts/benchmark_one_model.py {{ model }} --result-root results/joule --dataset-path sharegpt4video/sharegpt4video_700.json --gpu-ids {{ gpu }} --batch-sizes 64 32 16 8 4 2 1 --power-limits 700 --num-inference-steps 25"
|
3 |
+
model:
|
4 |
+
- '--model ali-vilab/i2vgen-xl --num-frames 16 --add-text-prompt'
|
5 |
+
- '--model stabilityai/stable-video-diffusion-img2vid --num-frames 14'
|
6 |
+
- '--model stabilityai/stable-video-diffusion-img2vid-xt --num-frames 25'
|
benchmark/diffusion/image-to-video/requirements.txt
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
torch
|
2 |
+
diffusers==0.29.2
|
3 |
+
accelerate
|
4 |
+
transformers
|
5 |
+
pillow
|
6 |
+
nvidia-ml-py
|
7 |
+
zeus-ml
|
benchmark/diffusion/image-to-video/scripts/aggregate_leaderboard_data.py
ADDED
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import json
|
2 |
+
from glob import glob
|
3 |
+
from pathlib import Path
|
4 |
+
|
5 |
+
import tyro
|
6 |
+
|
7 |
+
|
8 |
+
FIELDS = {
|
9 |
+
"model": "Model",
|
10 |
+
"gpu_model": "GPU",
|
11 |
+
"energy_per_video": "Energy/video (J)",
|
12 |
+
"average_batch_latency": "Batch latency (s)",
|
13 |
+
"batch_size": "Batch size",
|
14 |
+
"num_inference_steps": "Denoising steps",
|
15 |
+
"num_frames": "Frames",
|
16 |
+
}
|
17 |
+
|
18 |
+
def main(results_dir: Path, output_dir: Path) -> None:
|
19 |
+
print(f"{results_dir} -> {output_dir}")
|
20 |
+
|
21 |
+
for model_dir in sorted(glob(f"{results_dir}/*/*")):
|
22 |
+
model_name = "/".join(model_dir.split("/")[-2:])
|
23 |
+
print(f" {model_name}")
|
24 |
+
(output_dir / model_name).mkdir(parents=True, exist_ok=True)
|
25 |
+
for file in sorted(glob(f"{model_dir}/bs*+results.json")):
|
26 |
+
raw_data = json.load(open(file))
|
27 |
+
raw_data["energy_per_video"] = raw_data["average_batch_energy"] / raw_data["batch_size"]
|
28 |
+
|
29 |
+
data = {}
|
30 |
+
for field1, field2 in FIELDS.items():
|
31 |
+
data[field2] = raw_data.pop(field1)
|
32 |
+
|
33 |
+
filename = f"bs{data['Batch size']}+steps{data['Denoising steps']}+frames{data['Frames']}.json"
|
34 |
+
json.dump(data, open(output_dir / model_name/ filename, "w"), indent=2)
|
35 |
+
|
36 |
+
|
37 |
+
if __name__ == "__main__":
|
38 |
+
tyro.cli(main)
|
benchmark/diffusion/image-to-video/scripts/aggregate_leaderboard_models.py
ADDED
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import json
|
2 |
+
from glob import glob
|
3 |
+
from pathlib import Path
|
4 |
+
|
5 |
+
import tyro
|
6 |
+
|
7 |
+
def raw_params_to_readable(params: int) -> str:
|
8 |
+
return f"{params/1e9:.1f}B"
|
9 |
+
|
10 |
+
def main(results_dir: Path, output_file: Path) -> None:
|
11 |
+
output_file.parent.mkdir(parents=True, exist_ok=True)
|
12 |
+
print(f"{results_dir} -> {output_file}")
|
13 |
+
|
14 |
+
models = {}
|
15 |
+
for model_dir in sorted(glob(f"{results_dir}/*/*")):
|
16 |
+
model_name = "/".join(model_dir.split("/")[-2:])
|
17 |
+
print(f" {model_name}")
|
18 |
+
result_file_cand = glob(f"{model_dir}/bs1+*+results.json")
|
19 |
+
assert len(result_file_cand) == 1, model_name
|
20 |
+
results_data = json.load(open(result_file_cand[0]))
|
21 |
+
denosing_module_name = "unet" if "unet" in results_data["num_parameters"] else "transformer"
|
22 |
+
model_info = dict(
|
23 |
+
url=f"https://huggingface.co/{model_name}",
|
24 |
+
nickname=model_name.split("/")[-1].replace("-", " ").title(),
|
25 |
+
total_params=raw_params_to_readable(sum(results_data["num_parameters"].values())),
|
26 |
+
denoising_params=raw_params_to_readable(results_data["num_parameters"][denosing_module_name]),
|
27 |
+
resolution="NA",
|
28 |
+
)
|
29 |
+
assert model_name not in models
|
30 |
+
models[model_name] = model_info
|
31 |
+
|
32 |
+
json.dump(models, open(output_file, "w"), indent=2)
|
33 |
+
|
34 |
+
|
35 |
+
if __name__ == "__main__":
|
36 |
+
tyro.cli(main)
|
benchmark/diffusion/image-to-video/scripts/benchmark_one_datapoint.py
ADDED
@@ -0,0 +1,300 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from __future__ import annotations
|
2 |
+
|
3 |
+
import json
|
4 |
+
import inspect
|
5 |
+
import argparse
|
6 |
+
from pprint import pprint
|
7 |
+
from pathlib import Path
|
8 |
+
from contextlib import suppress
|
9 |
+
from dataclasses import dataclass, field, asdict
|
10 |
+
from typing import Any
|
11 |
+
|
12 |
+
import torch
|
13 |
+
import pynvml
|
14 |
+
import numpy as np
|
15 |
+
from PIL import Image
|
16 |
+
from transformers.trainer_utils import set_seed
|
17 |
+
from diffusers import ModelMixin, DiffusionPipeline # type: ignore
|
18 |
+
from diffusers.utils import load_image, export_to_gif # pyright: reportPrivateImportUsage=false
|
19 |
+
from zeus.monitor import ZeusMonitor
|
20 |
+
|
21 |
+
# Disable torch gradients globally
|
22 |
+
torch.set_grad_enabled(False)
|
23 |
+
|
24 |
+
|
25 |
+
@dataclass
|
26 |
+
class Results:
|
27 |
+
model: str
|
28 |
+
num_parameters: dict[str, int]
|
29 |
+
gpu_model: str
|
30 |
+
num_infernece_steps: int
|
31 |
+
num_frames: int
|
32 |
+
power_limit: int
|
33 |
+
batch_size: int
|
34 |
+
num_prompts: int
|
35 |
+
total_runtime: float = 0.0
|
36 |
+
total_energy: float = 0.0
|
37 |
+
average_batch_latency: float = 0.0
|
38 |
+
average_images_per_second: float = 0.0
|
39 |
+
average_batch_energy: float = 0.0
|
40 |
+
average_power_consumption: float = 0.0
|
41 |
+
peak_memory: float = 0.0
|
42 |
+
results: list[Result] = field(default_factory=list, repr=False)
|
43 |
+
|
44 |
+
|
45 |
+
@dataclass
|
46 |
+
class ResultIntermediateBatched:
|
47 |
+
prompts: list[str]
|
48 |
+
images: list[Image.Image]
|
49 |
+
batch_latency: float = 0.0
|
50 |
+
batch_energy: float = 0.0
|
51 |
+
frames: np.ndarray | list[list[Image.Image]] = np.empty(0)
|
52 |
+
|
53 |
+
|
54 |
+
@dataclass
|
55 |
+
class Result:
|
56 |
+
batch_latency: float
|
57 |
+
sample_energy: float
|
58 |
+
prompt: str
|
59 |
+
video_path: str | None
|
60 |
+
|
61 |
+
|
62 |
+
def get_pipeline(model_id: str):
|
63 |
+
"""Instantiate a Diffusers pipeline from a modes's HuggingFace Hub ID."""
|
64 |
+
# Load args to give to `from_pretrained` from the model's kwargs.json file
|
65 |
+
kwargs = json.load(open(f"models/{model_id}/kwargs.json"))
|
66 |
+
with suppress(KeyError):
|
67 |
+
kwargs["torch_dtype"] = eval(kwargs["torch_dtype"])
|
68 |
+
|
69 |
+
# Add additional args
|
70 |
+
kwargs["safety_checker"] = None
|
71 |
+
kwargs["revision"] = open(f"models/{model_id}/revision.txt").read().strip()
|
72 |
+
|
73 |
+
pipeline = DiffusionPipeline.from_pretrained(model_id, **kwargs).to("cuda:0")
|
74 |
+
print("\nInstantiated pipeline via DiffusionPipeline:\n", pipeline)
|
75 |
+
|
76 |
+
return pipeline
|
77 |
+
|
78 |
+
|
79 |
+
def load_text_image_prompts(
|
80 |
+
path: str,
|
81 |
+
batch_size: int,
|
82 |
+
num_batches: int | None = None,
|
83 |
+
) -> tuple[int, list[tuple[list[str], list[Image.Image]]]]:
|
84 |
+
"""Load the dataset to feed the model and return it as a list of batches of prompts.
|
85 |
+
|
86 |
+
Depending on the batch size, the final batch may not be full. The final batch
|
87 |
+
is dropped in that case. If `num_batches` is not None, only that many batches
|
88 |
+
is returned. If `num_batches` is None, all batches are returned.
|
89 |
+
|
90 |
+
Returns:
|
91 |
+
Total number of prompts and a list of batches of prompts.
|
92 |
+
"""
|
93 |
+
dataset = json.load(open(path))
|
94 |
+
assert len(dataset["caption"]) == len(dataset["video_id"])
|
95 |
+
|
96 |
+
if num_batches is not None:
|
97 |
+
if len(dataset["caption"]) < num_batches * batch_size:
|
98 |
+
raise ValueError("Not enough data for the requested number of batches.")
|
99 |
+
dataset["caption"] = dataset["caption"][: num_batches * batch_size]
|
100 |
+
dataset["video_id"] = dataset["video_id"][: num_batches * batch_size]
|
101 |
+
|
102 |
+
image_path = Path(path).parent / "first_frame"
|
103 |
+
dataset["first_frame"] = [
|
104 |
+
load_image(str(image_path / f"{video_id}.jpg")) for video_id in dataset["video_id"]
|
105 |
+
]
|
106 |
+
|
107 |
+
batched = [
|
108 |
+
(dataset["caption"][i : i + batch_size], dataset["first_frame"][i : i + batch_size])
|
109 |
+
for i in range(0, len(dataset["caption"]), batch_size)
|
110 |
+
]
|
111 |
+
if len(batched[-1]) < batch_size:
|
112 |
+
batched.pop()
|
113 |
+
|
114 |
+
return len(batched) * batch_size, batched
|
115 |
+
|
116 |
+
|
117 |
+
def count_parameters(pipeline) -> dict[str, int]:
|
118 |
+
"""Count the number of parameters in the given pipeline."""
|
119 |
+
num_params = {}
|
120 |
+
for name, attr in vars(pipeline).items():
|
121 |
+
if isinstance(attr, ModelMixin):
|
122 |
+
num_params[name] = attr.num_parameters(only_trainable=False, exclude_embeddings=True)
|
123 |
+
elif isinstance(attr, torch.nn.Module):
|
124 |
+
num_params[name] = sum(p.numel() for p in attr.parameters())
|
125 |
+
return num_params
|
126 |
+
|
127 |
+
|
128 |
+
def benchmark(args: argparse.Namespace) -> None:
|
129 |
+
if args.model.startswith("models/"):
|
130 |
+
args.model = args.model[len("models/") :]
|
131 |
+
if args.model.endswith("/"):
|
132 |
+
args.model = args.model[:-1]
|
133 |
+
|
134 |
+
set_seed(args.seed)
|
135 |
+
|
136 |
+
results_dir = Path(args.result_root) / args.model
|
137 |
+
results_dir.mkdir(parents=True, exist_ok=True)
|
138 |
+
benchmark_name = str(results_dir / f"bs{args.batch_size}+pl{args.power_limit}")
|
139 |
+
video_dir = results_dir / f"bs{args.batch_size}+pl{args.power_limit}+generated"
|
140 |
+
video_dir.mkdir(exist_ok=True)
|
141 |
+
|
142 |
+
arg_out_filename = f"{benchmark_name}+args.json"
|
143 |
+
with open(arg_out_filename, "w") as f:
|
144 |
+
f.write(json.dumps(vars(args), indent=2))
|
145 |
+
print(args)
|
146 |
+
print("Benchmark args written to", arg_out_filename)
|
147 |
+
|
148 |
+
zeus_monitor = ZeusMonitor()
|
149 |
+
|
150 |
+
pynvml.nvmlInit()
|
151 |
+
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
|
152 |
+
gpu_model = pynvml.nvmlDeviceGetName(handle)
|
153 |
+
pynvml.nvmlDeviceSetPersistenceMode(handle, pynvml.NVML_FEATURE_ENABLED)
|
154 |
+
pynvml.nvmlDeviceSetPowerManagementLimit(handle, args.power_limit * 1000)
|
155 |
+
pynvml.nvmlShutdown()
|
156 |
+
|
157 |
+
num_prompts, batched_prompts = load_text_image_prompts(args.dataset_path, args.batch_size, args.num_batches)
|
158 |
+
|
159 |
+
pipeline = get_pipeline(args.model)
|
160 |
+
|
161 |
+
# Warmup
|
162 |
+
print("Warming up with two batches...")
|
163 |
+
for i in range(2):
|
164 |
+
params: dict[str, Any] = dict(
|
165 |
+
image=batched_prompts[i][1],
|
166 |
+
num_frames=args.num_frames,
|
167 |
+
num_inference_steps=args.num_inference_steps,
|
168 |
+
)
|
169 |
+
if args.add_text_prompt:
|
170 |
+
params["prompt"] = batched_prompts[i][0]
|
171 |
+
|
172 |
+
_ = pipeline(**params)
|
173 |
+
|
174 |
+
rng = torch.manual_seed(args.seed)
|
175 |
+
|
176 |
+
# Some models require a text prompt alongside the image (e.g., I2VGen-XL)
|
177 |
+
# In that case, `prompts` will not be passed to the model.
|
178 |
+
intermediates: list[ResultIntermediateBatched] = [
|
179 |
+
ResultIntermediateBatched(prompts=text, images=image) for text, image in batched_prompts
|
180 |
+
]
|
181 |
+
|
182 |
+
# Different pipelines use different names for the FPS parameter
|
183 |
+
gen_signature= inspect.signature(pipeline.__call__)
|
184 |
+
fps_param_name_candidates = list(filter(lambda x: "fps" in x, gen_signature.parameters))
|
185 |
+
if not fps_param_name_candidates:
|
186 |
+
raise ValueError("No parameter with 'fps' in its name found in the pipeline's signature.")
|
187 |
+
if len(fps_param_name_candidates) > 1:
|
188 |
+
raise ValueError("Multiple parameters with 'fps' in their name found in the pipeline's signature.")
|
189 |
+
fps_param_name = fps_param_name_candidates[0]
|
190 |
+
|
191 |
+
torch.cuda.reset_peak_memory_stats(device="cuda:0")
|
192 |
+
zeus_monitor.begin_window("benchmark", sync_cuda=False)
|
193 |
+
|
194 |
+
# Build common parameter dict for all batches
|
195 |
+
params: dict[str, Any] = dict(
|
196 |
+
num_frames=args.num_frames,
|
197 |
+
num_inference_steps=args.num_inference_steps,
|
198 |
+
generator=rng,
|
199 |
+
)
|
200 |
+
params[fps_param_name] = args.fps
|
201 |
+
if args.height is not None:
|
202 |
+
params["height"] = args.height
|
203 |
+
if args.width is not None:
|
204 |
+
params["width"] = args.width
|
205 |
+
|
206 |
+
for ind, intermediate in enumerate(intermediates):
|
207 |
+
print(f"Batch {ind + 1}/{len(intermediates)}")
|
208 |
+
|
209 |
+
params["image"] = intermediate.images
|
210 |
+
if args.add_text_prompt:
|
211 |
+
params["prompt"] = intermediate.prompts
|
212 |
+
|
213 |
+
zeus_monitor.begin_window("batch", sync_cuda=False)
|
214 |
+
frames = pipeline(**params).frames
|
215 |
+
batch_measurements = zeus_monitor.end_window("batch", sync_cuda=False)
|
216 |
+
|
217 |
+
intermediate.frames = frames
|
218 |
+
intermediate.batch_latency = batch_measurements.time
|
219 |
+
intermediate.batch_energy = batch_measurements.total_energy
|
220 |
+
|
221 |
+
measurements = zeus_monitor.end_window("benchmark", sync_cuda=False)
|
222 |
+
peak_memory = torch.cuda.max_memory_allocated(device="cuda:0")
|
223 |
+
|
224 |
+
results: list[Result] = []
|
225 |
+
ind = 0
|
226 |
+
for intermediate in intermediates:
|
227 |
+
# Some pipelines just return a giant numpy array for all frames.
|
228 |
+
# In that case, scale frames to uint8 [0, 256] and convert to PIL.Image
|
229 |
+
if isinstance(intermediate.frames, np.ndarray):
|
230 |
+
frames = []
|
231 |
+
for video in intermediate.frames:
|
232 |
+
frames.append(
|
233 |
+
[Image.fromarray((frame * 255).astype(np.uint8)) for frame in video]
|
234 |
+
)
|
235 |
+
intermediate.frames = frames
|
236 |
+
|
237 |
+
for frames, prompt in zip(intermediate.frames, intermediate.prompts, strict=True):
|
238 |
+
if ind % args.save_every == 0:
|
239 |
+
video_path = str(video_dir / f"{prompt[:200]}.gif")
|
240 |
+
export_to_gif(frames, video_path, fps=args.fps)
|
241 |
+
else:
|
242 |
+
video_path = None
|
243 |
+
|
244 |
+
results.append(
|
245 |
+
Result(
|
246 |
+
batch_latency=intermediate.batch_latency,
|
247 |
+
sample_energy=intermediate.batch_energy / len(intermediate.prompts),
|
248 |
+
prompt=prompt,
|
249 |
+
video_path=video_path,
|
250 |
+
)
|
251 |
+
)
|
252 |
+
ind += 1
|
253 |
+
|
254 |
+
final_results = Results(
|
255 |
+
model=args.model,
|
256 |
+
num_parameters=count_parameters(pipeline),
|
257 |
+
gpu_model=gpu_model,
|
258 |
+
num_infernece_steps=args.num_inference_steps,
|
259 |
+
num_frames=args.num_frames,
|
260 |
+
power_limit=args.power_limit,
|
261 |
+
batch_size=args.batch_size,
|
262 |
+
num_prompts=num_prompts,
|
263 |
+
total_runtime=measurements.time,
|
264 |
+
total_energy=measurements.total_energy,
|
265 |
+
average_batch_latency=measurements.time / len(batched_prompts),
|
266 |
+
average_images_per_second=num_prompts / measurements.time,
|
267 |
+
average_batch_energy=measurements.total_energy / len(batched_prompts),
|
268 |
+
average_power_consumption=measurements.total_energy / measurements.time,
|
269 |
+
peak_memory=peak_memory,
|
270 |
+
results=results,
|
271 |
+
)
|
272 |
+
|
273 |
+
with open(f"{benchmark_name}+results.json", "w") as f:
|
274 |
+
f.write(json.dumps(asdict(final_results), indent=2))
|
275 |
+
print("Benchmark results written to", f"{benchmark_name}+results.json")
|
276 |
+
|
277 |
+
print("Benchmark results:")
|
278 |
+
pprint(final_results)
|
279 |
+
|
280 |
+
|
281 |
+
if __name__ == "__main__":
|
282 |
+
parser = argparse.ArgumentParser()
|
283 |
+
parser.add_argument("--model", type=str, required=True, help="The model to benchmark.")
|
284 |
+
parser.add_argument("--dataset-path", type=str, required=True, help="Path to the dataset to use.")
|
285 |
+
parser.add_argument("--add-text-prompt", action="store_true", help="Input text prompt alongside image.")
|
286 |
+
parser.add_argument("--result-root", type=str, help="The root directory to save results to.")
|
287 |
+
parser.add_argument("--batch-size", type=int, default=1, help="The size of each batch of prompts.")
|
288 |
+
parser.add_argument("--power-limit", type=int, default=300, help="The power limit to set for the GPU in Watts.")
|
289 |
+
parser.add_argument("--num-inference-steps", type=int, default=50, help="The number of denoising steps.")
|
290 |
+
parser.add_argument("--num-frames", type=int, default=1, help="The number of frames to generate.")
|
291 |
+
parser.add_argument("--fps", type=int, default=16, help="Frames per second for micro-conditioning.")
|
292 |
+
parser.add_argument("--height", type=int, help="Height of the generated video.")
|
293 |
+
parser.add_argument("--width", type=int, help="Width of the generated video.")
|
294 |
+
parser.add_argument("--num-batches", type=int, default=None, help="The number of batches to use from the dataset.")
|
295 |
+
parser.add_argument("--save-every", type=int, default=10, help="Save generations to file every N prompts.")
|
296 |
+
parser.add_argument("--seed", type=int, default=0, help="The seed to use for the RNG.")
|
297 |
+
parser.add_argument("--huggingface-token", type=str, help="The HuggingFace token to use.")
|
298 |
+
args = parser.parse_args()
|
299 |
+
|
300 |
+
benchmark(args)
|
benchmark/diffusion/image-to-video/scripts/benchmark_one_model.py
ADDED
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from __future__ import annotations
|
2 |
+
|
3 |
+
import os
|
4 |
+
import argparse
|
5 |
+
import subprocess
|
6 |
+
|
7 |
+
|
8 |
+
def print_and_write(outfile, line: str, flush: bool = False):
|
9 |
+
print(line, end="", flush=flush)
|
10 |
+
outfile.write(line)
|
11 |
+
if flush:
|
12 |
+
outfile.flush()
|
13 |
+
|
14 |
+
|
15 |
+
def main(args: argparse.Namespace) -> None:
|
16 |
+
assert len(args.gpu_ids) == 1
|
17 |
+
|
18 |
+
hf_token = os.environ["HF_TOKEN"]
|
19 |
+
|
20 |
+
if args.model.startswith("models/"):
|
21 |
+
outdir = f"{args.result_root}/{args.model[len('models/'):]}"
|
22 |
+
else:
|
23 |
+
outdir = f"{args.result_root}/{args.model}"
|
24 |
+
os.makedirs(outdir, exist_ok=True)
|
25 |
+
|
26 |
+
outfile = open(f"{outdir}/gpus{''.join(args.gpu_ids)}.out.txt", "w")
|
27 |
+
|
28 |
+
print_and_write(outfile, f"Benchmarking {args.model}\n")
|
29 |
+
print_and_write(outfile, f"Batch sizes: {args.batch_sizes}\n")
|
30 |
+
print_and_write(outfile, f"Power limits: {args.power_limits}\n")
|
31 |
+
|
32 |
+
for batch_size in args.batch_sizes:
|
33 |
+
for power_limit in args.power_limits:
|
34 |
+
print_and_write(outfile, f"{batch_size=}, {power_limit=}\n", flush=True)
|
35 |
+
with subprocess.Popen(
|
36 |
+
args=[
|
37 |
+
"docker", "run",
|
38 |
+
"--gpus", '"device=' + ','.join(args.gpu_ids) + '"',
|
39 |
+
"--cap-add", "SYS_ADMIN",
|
40 |
+
"--name", f"leaderboard-i2v-{''.join(args.gpu_ids)}",
|
41 |
+
"--rm",
|
42 |
+
"-v", "/data/leaderboard/hfcache:/root/.cache/huggingface",
|
43 |
+
"-v", f"{os.getcwd()}:/workspace/image-to-video",
|
44 |
+
"mlenergy/leaderboard:diffusion-i2v",
|
45 |
+
"--dataset-path", args.dataset_path,
|
46 |
+
"--result-root", args.result_root,
|
47 |
+
"--batch-size", batch_size,
|
48 |
+
"--num-batches", "10",
|
49 |
+
"--power-limit", power_limit,
|
50 |
+
"--model", args.model,
|
51 |
+
"--huggingface-token", hf_token,
|
52 |
+
"--num-frames", args.num_frames,
|
53 |
+
"--num-inference-steps", args.num_inference_steps,
|
54 |
+
] + (["--add-text-prompt"] if args.add_text_prompt else []),
|
55 |
+
stdout=subprocess.PIPE,
|
56 |
+
stderr=subprocess.STDOUT,
|
57 |
+
text=True,
|
58 |
+
) as proc:
|
59 |
+
if proc.stdout:
|
60 |
+
i = 0
|
61 |
+
for line in proc.stdout:
|
62 |
+
print_and_write(outfile, line, flush=i % 50 == 0)
|
63 |
+
i += 1
|
64 |
+
|
65 |
+
# If proc exited with non-zero status, it's probably an OOM.
|
66 |
+
# Move on to the next batch size.
|
67 |
+
if proc.returncode != 0:
|
68 |
+
break
|
69 |
+
|
70 |
+
|
71 |
+
|
72 |
+
if __name__ == "__main__":
|
73 |
+
parser = argparse.ArgumentParser()
|
74 |
+
parser.add_argument("--model", type=str, help="ID of the model to benchmark")
|
75 |
+
parser.add_argument("--result-root", type=str, help="Root directory to store the results")
|
76 |
+
parser.add_argument("--gpu-ids", type=str, nargs="+", help="GPU IDs to use")
|
77 |
+
parser.add_argument("--batch-sizes", type=str, nargs="+", default=["8", "4", "2", "1"], help="Batch sizes to benchmark")
|
78 |
+
parser.add_argument("--power-limits", type=str, nargs="+", default=["400", "300", "200"], help="Power limits to benchmark")
|
79 |
+
parser.add_argument("--num-frames", type=str, help="Number of frames to generate")
|
80 |
+
parser.add_argument("--num-inference-steps", type=str, help="Number of denoising steps")
|
81 |
+
parser.add_argument("--add-text-prompt", action="store_true", help="Input text prompt alongside image.")
|
82 |
+
parser.add_argument("--dataset-path", type=str, help="Path to the dataset JSON file.")
|
83 |
+
args = parser.parse_args()
|
84 |
+
main(args)
|
benchmark/diffusion/image-to-video/sharegpt4video/.gitignore
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
first_frame/
|
benchmark/diffusion/image-to-video/sharegpt4video/README.md
ADDED
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# ShareGPT4Video dataset
|
2 |
+
|
3 |
+
For the image-to-video task, we sample 100 video-caption pairs from the ShareGPT4Video datset to feed to the diffusion model to generate videos.
|
4 |
+
|
5 |
+
## Filtering the dataset
|
6 |
+
|
7 |
+
Download the dataset with captions and video paths.
|
8 |
+
|
9 |
+
```sh
|
10 |
+
wget https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video/resolve/main/sharegpt4video_40k.jsonl
|
11 |
+
```
|
12 |
+
|
13 |
+
Sample video-caption pairs.
|
14 |
+
You can adjust the `NUM_SAMPLES` variable in the script to change the size of the generated dataset. By default, 100 pairs will be sampled and saved as `sharegpt4video_100.json`.
|
15 |
+
|
16 |
+
```sh
|
17 |
+
python sample.py
|
18 |
+
```
|
19 |
+
|
20 |
+
Download and unzip the chunk of videos.
|
21 |
+
|
22 |
+
```sh
|
23 |
+
wget https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video/resolve/main/zip_folder/panda/panda_videos_1.zip
|
24 |
+
unzip panda_videos_1.zip -d panda
|
25 |
+
```
|
26 |
+
|
27 |
+
Extract the first frame of the video and save under `first_frame/`.
|
28 |
+
|
29 |
+
```sh
|
30 |
+
pip install opencv-python
|
31 |
+
python extract_first_frame.py
|
32 |
+
```
|
benchmark/diffusion/image-to-video/sharegpt4video/extract_first_frame.py
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import json
|
3 |
+
|
4 |
+
import cv2
|
5 |
+
|
6 |
+
DATASET_PATH = "sharegpt4video_700.json"
|
7 |
+
|
8 |
+
|
9 |
+
def main() -> None:
|
10 |
+
os.makedirs("first_frame", exist_ok=True)
|
11 |
+
|
12 |
+
for video_id in json.load(open(DATASET_PATH))["video_id"]:
|
13 |
+
cap = cv2.VideoCapture(f"panda/{video_id}.mp4")
|
14 |
+
ret, frame = cap.read()
|
15 |
+
assert ret, f"failed to read first frame of video {video_id}"
|
16 |
+
cv2.imwrite(f"first_frame/{video_id}.jpg", frame)
|
17 |
+
cap.release()
|
18 |
+
|
19 |
+
|
20 |
+
if __name__ == "__main__":
|
21 |
+
main()
|
benchmark/diffusion/image-to-video/sharegpt4video/sample.py
ADDED
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import json
|
2 |
+
import random
|
3 |
+
|
4 |
+
DATASET_PATH = "sharegpt4video_40k.jsonl"
|
5 |
+
VIDEO_SHARD_NAME = "panda_videos_1.zip"
|
6 |
+
NUM_SAMPLES = 700
|
7 |
+
SEED = 1
|
8 |
+
|
9 |
+
|
10 |
+
def main() -> None:
|
11 |
+
dataset = [json.loads(line) for line in open(DATASET_PATH) if VIDEO_SHARD_NAME in line]
|
12 |
+
random.seed(SEED)
|
13 |
+
random.shuffle(dataset)
|
14 |
+
|
15 |
+
sampled = dict(caption=[], video_id=[])
|
16 |
+
for sample in dataset[:NUM_SAMPLES]:
|
17 |
+
assert sample["zip_folder"] == VIDEO_SHARD_NAME, f"sample from wrong video shard: {sample}"
|
18 |
+
whole_video_caption = next(
|
19 |
+
(c for c in sample["captions"] if c["idx"] == "-1"), None
|
20 |
+
)
|
21 |
+
assert whole_video_caption is not None, f"whole video caption not found for sample: {sample}"
|
22 |
+
sampled["caption"].append(whole_video_caption["content"])
|
23 |
+
sampled["video_id"].append(sample["video_id"])
|
24 |
+
|
25 |
+
json.dump(sampled, open(f"sharegpt4video_{NUM_SAMPLES}.json", "w"))
|
26 |
+
|
27 |
+
|
28 |
+
if __name__ == "__main__":
|
29 |
+
main()
|
benchmark/diffusion/image-to-video/sharegpt4video/sharegpt4video_100.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
benchmark/diffusion/text-to-image/.dockerignore
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
README.md
|
benchmark/diffusion/text-to-image/Dockerfile
ADDED
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
FROM nvidia/cuda:12.1.0-base-ubuntu22.04
|
2 |
+
|
3 |
+
# Basic installs
|
4 |
+
ARG DEBIAN_FRONTEND=noninteractive
|
5 |
+
ENV TZ='America/Detroit'
|
6 |
+
RUN apt-get update -qq \
|
7 |
+
&& apt-get -y --no-install-recommends install python3-pip \
|
8 |
+
&& apt-get clean all \
|
9 |
+
&& rm -r /var/lib/apt/lists/*
|
10 |
+
|
11 |
+
# HuggingFace cache dir
|
12 |
+
ENV HF_HOME=/root/.cache/huggingface
|
13 |
+
|
14 |
+
# Copy over benchmark suite and install dependencies
|
15 |
+
ADD . /workspace/text-to-image
|
16 |
+
WORKDIR /workspace/text-to-image
|
17 |
+
RUN pip install -r requirements.txt
|
18 |
+
|
19 |
+
# Benchmark script to run
|
20 |
+
ENTRYPOINT ["python3", "scripts/benchmark_one_datapoint.py"]
|
benchmark/diffusion/text-to-image/README.md
ADDED
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Diffusion model (Text to Image)
|
2 |
+
|
3 |
+
This benchmark suite benchmarks diffusion models with the text-to-image task.
|
4 |
+
|
5 |
+
## Setup
|
6 |
+
|
7 |
+
### Docker images
|
8 |
+
|
9 |
+
```sh
|
10 |
+
docker build -t mlenergy/leaderboard:diffusion-t2i .
|
11 |
+
```
|
12 |
+
|
13 |
+
### HuggingFace cache directory
|
14 |
+
|
15 |
+
The scripts assume the HuggingFace cache directory will be under `/data/leaderboard/hfcache` on the node that runs this benchmark.
|
16 |
+
|
17 |
+
## Benchmarking
|
18 |
+
|
19 |
+
### Obtaining one datapoint
|
20 |
+
|
21 |
+
The Docker image we've build runs `python scripts/benchmark_one_datapoint.py` as its `ENTRYPOINT`.
|
22 |
+
|
23 |
+
```sh
|
24 |
+
docker run \
|
25 |
+
--gpus '"device=0"' \
|
26 |
+
--cap-add SYS_ADMIN \
|
27 |
+
-v /data/leaderboard/hfcache:/root/.cache/huggingface
|
28 |
+
-v $(pwd):/workspace/text-to-image \
|
29 |
+
mlenergy/leaderboard:diffusion-t2i \
|
30 |
+
--result-root results \
|
31 |
+
--batch-size 2 \
|
32 |
+
--power-limit 300 \
|
33 |
+
--image-save-every 5 \
|
34 |
+
--num-inference-steps 25 \
|
35 |
+
--model stabilityai/stable-diffusion-2-1 \
|
36 |
+
--huggingface-token $HF_TOKEN
|
37 |
+
```
|
38 |
+
|
39 |
+
### Obtaining all datapoints for a single model
|
40 |
+
|
41 |
+
Export your HuggingFace hub token as environment variable `$HF_TOKEN`.
|
42 |
+
|
43 |
+
Run `scripts/benchmark_one_model.py`.
|
44 |
+
|
45 |
+
### Running the entire suite with Pegasus
|
46 |
+
|
47 |
+
You can use [`pegasus`](https://github.com/jaywonchung/pegasus) to run the entire benchmark suite.
|
48 |
+
Queue and host files are in [`./pegasus`](./pegasus).
|
benchmark/diffusion/text-to-image/models/SimianLuo/LCM_Dreamshaper_v7/kwargs.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"torch_dtype": "torch.float16"
|
3 |
+
}
|
benchmark/diffusion/text-to-image/models/SimianLuo/LCM_Dreamshaper_v7/revision.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
4721097975058205c4edcdece2cc574b7dd7bc04
|
benchmark/diffusion/text-to-image/models/kandinsky-community/kandinsky-2-2-decoder/kwargs.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"torch_dtype": "torch.float16"
|
3 |
+
}
|
benchmark/diffusion/text-to-image/models/kandinsky-community/kandinsky-2-2-decoder/revision.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
main
|
benchmark/diffusion/text-to-image/models/kandinsky-community/kandinsky-3/kwargs.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"torch_dtype": "torch.float16",
|
3 |
+
"variant": "fp16"
|
4 |
+
}
|
benchmark/diffusion/text-to-image/models/kandinsky-community/kandinsky-3/revision.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
bf79e6c219da8a94abb50235fdc4567eb8fb4632
|
benchmark/diffusion/text-to-image/models/prompthero/openjourney-v4/kwargs.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"torch_dtype": "torch.float16"
|
3 |
+
}
|
benchmark/diffusion/text-to-image/models/prompthero/openjourney-v4/revision.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
b195ed2d503f3eb29637050a886d77bd81d35f0e
|
benchmark/diffusion/text-to-image/models/segmind/SSD-1B/kwargs.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"torch_dtype": "torch.float16",
|
3 |
+
"variant": "fp16"
|
4 |
+
}
|
benchmark/diffusion/text-to-image/models/segmind/SSD-1B/revision.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
60987f37e94cd59c36b1cba832b9f97b57395a10
|
benchmark/diffusion/text-to-image/models/stabilityai/sdxl-turbo/kwargs.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"torch_dtype": "torch.float16",
|
3 |
+
"variant": "fp16"
|
4 |
+
}
|
benchmark/diffusion/text-to-image/models/stabilityai/sdxl-turbo/revision.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
f4b0486b498f84668e828044de1d0c8ba486e05b
|
benchmark/diffusion/text-to-image/models/stabilityai/stable-cascade/kwargs.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"torch_dtype": "torch.bfloat16",
|
3 |
+
"variant": "bf16"
|
4 |
+
}
|
benchmark/diffusion/text-to-image/models/stabilityai/stable-cascade/revision.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
main
|
benchmark/diffusion/text-to-image/models/stabilityai/stable-diffusion-2-1/kwargs.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"torch_dtype": "torch.float16",
|
3 |
+
"variant": "fp16"
|
4 |
+
}
|