Jae-Won Chung commited on
Commit
b10121d
·
1 Parent(s): 07a9e13

New leaderboard prototype

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .github/workflows/push_spaces.yaml +2 -0
  2. .gitignore +2 -0
  3. README.md +15 -33
  4. _config.yml +9 -3
  5. app.py +735 -253
  6. benchmark/.gitignore +1 -0
  7. benchmark/README.md +14 -0
  8. benchmark/common/download_weights.sh +7 -0
  9. benchmark/common/start_nvml_container.sh +3 -0
  10. benchmark/diffusion/image-to-video/.dockerignore +1 -0
  11. benchmark/diffusion/image-to-video/Dockerfile +20 -0
  12. benchmark/diffusion/image-to-video/README.md +51 -0
  13. benchmark/diffusion/image-to-video/models/ali-vilab/i2vgen-xl/kwargs.json +4 -0
  14. benchmark/diffusion/image-to-video/models/ali-vilab/i2vgen-xl/revision.txt +1 -0
  15. benchmark/diffusion/image-to-video/models/stabilityai/stable-video-diffusion-img2vid-xt/kwargs.json +4 -0
  16. benchmark/diffusion/image-to-video/models/stabilityai/stable-video-diffusion-img2vid-xt/revision.txt +1 -0
  17. benchmark/diffusion/image-to-video/models/stabilityai/stable-video-diffusion-img2vid/kwargs.json +4 -0
  18. benchmark/diffusion/image-to-video/models/stabilityai/stable-video-diffusion-img2vid/revision.txt +1 -0
  19. benchmark/diffusion/image-to-video/pegasus/A100/hosts_1gpu.yaml +11 -0
  20. benchmark/diffusion/image-to-video/pegasus/A100/queue_1gpu.yaml +6 -0
  21. benchmark/diffusion/image-to-video/pegasus/H100/hosts_1gpu.yaml +11 -0
  22. benchmark/diffusion/image-to-video/pegasus/H100/queue_1gpu.yaml +6 -0
  23. benchmark/diffusion/image-to-video/requirements.txt +7 -0
  24. benchmark/diffusion/image-to-video/scripts/aggregate_leaderboard_data.py +38 -0
  25. benchmark/diffusion/image-to-video/scripts/aggregate_leaderboard_models.py +36 -0
  26. benchmark/diffusion/image-to-video/scripts/benchmark_one_datapoint.py +300 -0
  27. benchmark/diffusion/image-to-video/scripts/benchmark_one_model.py +84 -0
  28. benchmark/diffusion/image-to-video/sharegpt4video/.gitignore +1 -0
  29. benchmark/diffusion/image-to-video/sharegpt4video/README.md +32 -0
  30. benchmark/diffusion/image-to-video/sharegpt4video/extract_first_frame.py +21 -0
  31. benchmark/diffusion/image-to-video/sharegpt4video/sample.py +29 -0
  32. benchmark/diffusion/image-to-video/sharegpt4video/sharegpt4video_100.json +0 -0
  33. benchmark/diffusion/text-to-image/.dockerignore +1 -0
  34. benchmark/diffusion/text-to-image/Dockerfile +20 -0
  35. benchmark/diffusion/text-to-image/README.md +48 -0
  36. benchmark/diffusion/text-to-image/models/SimianLuo/LCM_Dreamshaper_v7/kwargs.json +3 -0
  37. benchmark/diffusion/text-to-image/models/SimianLuo/LCM_Dreamshaper_v7/revision.txt +1 -0
  38. benchmark/diffusion/text-to-image/models/kandinsky-community/kandinsky-2-2-decoder/kwargs.json +3 -0
  39. benchmark/diffusion/text-to-image/models/kandinsky-community/kandinsky-2-2-decoder/revision.txt +1 -0
  40. benchmark/diffusion/text-to-image/models/kandinsky-community/kandinsky-3/kwargs.json +4 -0
  41. benchmark/diffusion/text-to-image/models/kandinsky-community/kandinsky-3/revision.txt +1 -0
  42. benchmark/diffusion/text-to-image/models/prompthero/openjourney-v4/kwargs.json +3 -0
  43. benchmark/diffusion/text-to-image/models/prompthero/openjourney-v4/revision.txt +1 -0
  44. benchmark/diffusion/text-to-image/models/segmind/SSD-1B/kwargs.json +4 -0
  45. benchmark/diffusion/text-to-image/models/segmind/SSD-1B/revision.txt +1 -0
  46. benchmark/diffusion/text-to-image/models/stabilityai/sdxl-turbo/kwargs.json +4 -0
  47. benchmark/diffusion/text-to-image/models/stabilityai/sdxl-turbo/revision.txt +1 -0
  48. benchmark/diffusion/text-to-image/models/stabilityai/stable-cascade/kwargs.json +4 -0
  49. benchmark/diffusion/text-to-image/models/stabilityai/stable-cascade/revision.txt +1 -0
  50. benchmark/diffusion/text-to-image/models/stabilityai/stable-diffusion-2-1/kwargs.json +4 -0
.github/workflows/push_spaces.yaml CHANGED
@@ -1,6 +1,7 @@
1
  name: Deploy
2
 
3
  on:
 
4
  push:
5
  branches:
6
  - master
@@ -34,6 +35,7 @@ jobs:
34
  env:
35
  HF_TOKEN: ${{ secrets.HF_TOKEN }}
36
  run: |
 
37
  for i in 1 2 3 4 5; do
38
  git push -f https://jaywonchung:[email protected]/spaces/ml-energy/leaderboard master:main && break || sleep 5;
39
  done
 
1
  name: Deploy
2
 
3
  on:
4
+ workflow_dispatch:
5
  push:
6
  branches:
7
  - master
 
35
  env:
36
  HF_TOKEN: ${{ secrets.HF_TOKEN }}
37
  run: |
38
+ git lfs install
39
  for i in 1 2 3 4 5; do
40
  git push -f https://jaywonchung:[email protected]/spaces/ml-energy/leaderboard master:main && break || sleep 5;
41
  done
.gitignore CHANGED
@@ -12,7 +12,9 @@ pyrightconfig.json
12
  # Python
13
  *.egg-info
14
  **/__pycache__
 
15
  build/
 
16
 
17
  # Data files
18
  *.log
 
12
  # Python
13
  *.egg-info
14
  **/__pycache__
15
+ **/.ipynb_checkpoints
16
  build/
17
+ **.ipynb
18
 
19
  # Data files
20
  *.log
README.md CHANGED
@@ -15,49 +15,31 @@ tags: ["energy", "leaderboard"]
15
  [![Deploy](https://github.com/ml-energy/leaderboard/actions/workflows/push_spaces.yaml/badge.svg?branch=web)](https://github.com/ml-energy/leaderboard/actions/workflows/push_spaces.yaml)
16
  [![Apache-2.0 License](https://custom-icon-badges.herokuapp.com/github/license/ml-energy/leaderboard?logo=law)](/LICENSE)
17
 
18
- How much energy do LLMs consume?
19
 
20
  This README focuses on explaining how to run the benchmark yourself.
21
  The actual leaderboard is here: https://ml.energy/leaderboard.
22
 
23
- ## Colosseum
24
-
25
- We instrumented [Hugging Face TGI](https://github.com/huggingface/text-generation-inference) so that it measures and returns GPU energy consumption.
26
- Then, our [controller](/spitfight/colosseum/controller) server receives user prompts from the [Gradio app](/app.py), selects two models randomly, and streams model responses back with energy consumption.
27
-
28
- ## Setup for benchmarking
29
-
30
- ### Model weights
31
-
32
- - For models that are directly accessible in Hugging Face Hub, you don't need to do anything.
33
- - For other models, convert them to Hugging Face format and put them in `/data/leaderboard/weights/lmsys/vicuna-13B`, for example. The last two path components (e.g., `lmsys/vicuna-13B`) are taken as the name of the model.
34
-
35
- ### Docker container
36
 
37
- We have our pre-built Docker image published with the tag `mlenergy/leaderboard:latest` ([Dockerfile](/Dockerfile)).
38
-
39
- ```console
40
- $ docker run -it \
41
- --name leaderboard0 \
42
- --gpus '"device=0"' \
43
- -v /path/to/your/data/dir:/data/leaderboard \
44
- -v $(pwd):/workspace/leaderboard \
45
- mlenergy/leaderboard:latest bash
46
  ```
47
 
48
- The container internally expects weights to be inside `/data/leaderboard/weights` (e.g., `/data/leaderboard/weights/lmsys/vicuna-7B`), and sets the Hugging Face cache directory to `/data/leaderboard/hfcache`.
49
- If needed, the repository should be mounted to `/workspace/leaderboard` to override the copy of the repository inside the container.
50
-
51
- ## Running the benchmark
52
 
53
- We run benchmarks using multiple nodes and GPUs using [Pegasus](https://github.com/jaywonchung/pegasus). Take a look at [`pegasus/`](/pegasus) for details.
 
54
 
55
- You can still run benchmarks without Pegasus like this:
56
 
57
- ```console
58
- $ docker exec leaderboard0 python scripts/benchmark.py --model-path /data/leaderboard/weights/lmsys/vicuna-13B --input-file sharegpt/sg_90k_part1_html_cleaned_lang_first_sampled_sorted.json
59
- $ docker exec leaderboard0 python scripts/benchmark.py --model-path databricks/dolly-v2-12b --input-file sharegpt/sg_90k_part1_html_cleaned_lang_first_sampled_sorted.json
60
- ```
61
 
62
  ## Citation
63
 
 
15
  [![Deploy](https://github.com/ml-energy/leaderboard/actions/workflows/push_spaces.yaml/badge.svg?branch=web)](https://github.com/ml-energy/leaderboard/actions/workflows/push_spaces.yaml)
16
  [![Apache-2.0 License](https://custom-icon-badges.herokuapp.com/github/license/ml-energy/leaderboard?logo=law)](/LICENSE)
17
 
18
+ How much energy do GenAI models like LLMs and Diffusion models consume?
19
 
20
  This README focuses on explaining how to run the benchmark yourself.
21
  The actual leaderboard is here: https://ml.energy/leaderboard.
22
 
23
+ ## Repository Organization
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
+ ```
26
+  leaderboard/
27
+ ├──  benchmark/ # Benchmark scripts & instructions
28
+ ├── data/ # Benchmark results
29
+ ├── deployment/ # Colosseum deployment files
30
+ ├── spitfight/ # Python package for the Colosseum
31
+ ├── app.py # Leaderboard Gradio app definition
32
+ └──  index.html # Embeds the leaderboard HuggingFace Space
 
33
  ```
34
 
35
+ ## Colosseum
 
 
 
36
 
37
+ We instrumented [Hugging Face TGI](https://github.com/huggingface/text-generation-inference) so that it measures and returns GPU energy consumption.
38
+ Then, our [controller](/spitfight/colosseum/controller) server receives user prompts from the [Gradio app](/app.py), selects two models randomly, and streams model responses back with energy consumption.
39
 
40
+ ## Running the Benchmark
41
 
42
+ We open-sourced the entire benchmark with instructions here: [`./benchmark`](./benchmark)
 
 
 
43
 
44
  ## Citation
45
 
_config.yml CHANGED
@@ -1,6 +1,12 @@
1
  exclude:
 
2
  - deployment/
3
- - pegasus/
4
- - scripts/
5
- - sharegpt/
6
  - tests/
 
 
 
 
 
 
 
1
  exclude:
2
+ - benchmark/
3
  - deployment/
4
+ - spitfight/
5
+ - docs/
 
6
  - tests/
7
+ - .gitignore
8
+ - app.py
9
+ - LICENSE
10
+ - README.md
11
+ - requirements.txt
12
+ - setup.py
app.py CHANGED
@@ -1,5 +1,12 @@
 
 
 
 
 
 
1
  from __future__ import annotations
2
 
 
3
  import copy
4
  import json
5
  import random
@@ -9,16 +16,13 @@ import itertools
9
  import contextlib
10
  import argparse
11
  import os
12
- from typing import Literal
 
13
  from dateutil import parser, tz
14
 
15
  import numpy as np
16
  import gradio as gr
17
  import pandas as pd
18
- import plotly.io as pio
19
- import plotly.express as px
20
- from pandas.api.types import is_numeric_dtype, is_float_dtype
21
- pio.templates.default = "plotly_white"
22
 
23
  from spitfight.colosseum.client import ControllerClient
24
 
@@ -28,8 +32,499 @@ COLOSSUMM_YOUTUBE_DEMO_EMBED_HTML = '<div style="width: 100%; min-width: 400px;"
28
 
29
 
30
  class TableManager:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  def __init__(self, data_dir: str) -> None:
32
- """Load leaderboard data from CSV files in data_dir.
33
 
34
  Inside `data_dir`, there should be:
35
  - `models.json`: a JSON file containing information about each model.
@@ -58,6 +553,7 @@ class TableManager:
58
  f'<a style="text-decoration: underline; text-decoration-style: dotted" '
59
  f'target="_blank" href="{url}">{nickname}</a>'
60
  )
 
61
  df["model"] = df["model"].apply(format_model_link)
62
 
63
  # Sort by our 'energy efficiency' score.
@@ -110,63 +606,6 @@ class TableManager:
110
  """Formats into HTML that prints in Monospace font."""
111
  return f"<pre style='font-family: monospace'>{text}</pre>"
112
 
113
- def add_column(self, column_name: str, formula: str):
114
- """Create and add a new column with the given formula."""
115
- # If the user did not provide the name of the new column,
116
- # generate a unique name for them.
117
- if not column_name:
118
- counter = 1
119
- while (column_name := f"custom{counter}") in self.full_df.columns:
120
- counter += 1
121
-
122
- # If the user did not provide a formula, return an error message.
123
- if not formula:
124
- return self.cur_df, self._format_msg("Please enter a formula.")
125
-
126
- # If there is an equal sign in the formula, `df.eval` will
127
- # return an entire DataFrame with the new column, instead of
128
- # just the new column. This is not what we want, so we check
129
- # for this case and return an error message.
130
- if "=" in formula:
131
- return self.cur_df, self._format_msg("Invalid formula: expr cannot contain '='.")
132
-
133
- # The user may want to update an existing column.
134
- verb = "Updated" if column_name in self.full_df.columns else "Added"
135
-
136
- # Evaluate the formula and catch any error.
137
- try:
138
- # Give the users some helper functions that can be used in the formula
139
- # like "@sum(response_length)". Also wipe out some global variables.
140
- col = self.full_df.eval(
141
- formula,
142
- local_dict={"sum": sum, "len": len, "max": max, "min": min},
143
- global_dict={"global_tbm": None},
144
- )
145
- except Exception as exc:
146
- return self.cur_df, self._format_msg(f"Invalid formula: {exc}")
147
-
148
- # If the result is a numeric scalar, make it a Series.
149
- # We may have deleted some models (rows) form the full dataframe when we
150
- # called dropna, so we need to query the maximum index instead of taking len.
151
- if isinstance(col, (int, float)):
152
- col = pd.Series([col] * (self.full_df.index.max() + 1))
153
- # We only accept numeric columns.
154
- if not is_numeric_dtype(col):
155
- return self.cur_df, self._format_msg("Invalid formula: result must be numeric.")
156
- # Round if it's floating point.
157
- if is_float_dtype(col):
158
- col = col.round(2)
159
-
160
- # If the column already exists, update it.
161
- if column_name in self.full_df.columns:
162
- self.full_df[column_name] = col
163
- else:
164
- self.full_df.insert(len(self.schema) + 1, column_name, col)
165
-
166
- # If adding a column succeeded, `self.cur_df` should also be updated.
167
- self.cur_df = self.full_df.loc[self.cur_index]
168
- return self.cur_df, self._format_msg(f"{verb} column '{column_name}'.")
169
-
170
  def get_dropdown(self):
171
  columns = self.full_df.columns.tolist()[1:]
172
  return [
@@ -196,51 +635,40 @@ class TableManager:
196
  self.cur_index = index
197
  return self.cur_df
198
 
199
- def plot_scatter(self, width, height, x, y, z):
200
- # The user did not select either x or y.
201
- if not x or not y:
202
- return None, width, height, self._format_msg("Please select both X and Y.")
203
-
204
- # Width and height may be an empty string. Then we set them to 600.
205
- if not width and not height:
206
- width, height = "600", "600"
207
- elif not width:
208
- width = height
209
- elif not height:
210
- height = width
211
- try:
212
- width, height = int(width), int(height)
213
- except ValueError:
214
- return None, width, height, self._format_msg("Width and height should be positive integers.")
215
-
216
- # Strip the <a> tag from model names.
217
- text = self.cur_df["model"].apply(lambda x: x.split(">")[1].split("<")[0])
218
- # Hide model names since they clutter the plots, and only show them on hover.
219
- if z is None or z == "None" or z == "":
220
- fig = px.scatter(self.cur_df, x=x, y=y, hover_name=text)
221
- else:
222
- fig = px.scatter_3d(self.cur_df, x=x, y=y, z=z, hover_name=text)
223
- fig.update_traces(marker=dict(size=12, line=dict(width=2, color="DarkSlateGrey")))
224
- fig.update_layout(width=width, height=height)
225
 
226
- return fig, width, height, ""
227
 
228
  # The global instance of the TableManager should only be used when
229
  # initializing components in the Gradio interface. If the global instance
230
  # is mutated while handling user sessions, the change will be reflected
231
  # in every user session. Instead, the instance provided by gr.State should
232
  # be used.
233
- global_tbm = TableManager("data")
234
-
235
- # Fetch the latest update date of the leaderboard repository.
236
- resp = requests.get("https://api.github.com/repos/ml-energy/leaderboard/commits/master")
237
- if resp.status_code != 200:
238
- current_date = "[Failed to fetch]"
239
- print("Failed to fetch the latest release date of the leaderboard repository.")
240
- print(resp.json())
241
- else:
242
- current_datetime = parser.parse(resp.json()["commit"]["author"]["date"])
243
- current_date = current_datetime.astimezone(tz.gettz("US/Eastern")).strftime("%Y-%m-%d")
244
 
245
  # Custom JS.
246
  # XXX: This is a hack to make the model names clickable.
@@ -254,11 +682,14 @@ else:
254
  dataframe_update_js = f"""
255
  function format_model_link() {{
256
  // Iterate over the cells of the first column of the leaderboard table.
257
- for (let index = 1; index <= {len(global_tbm.full_df)}; index++) {{
258
- // Get the cell.
259
- var cell = document.querySelector(
260
- `#tab-leaderboard > div > div > div > table > tbody > tr:nth-child(${{index}}) > td:nth-child(1) > div > span`
261
- );
 
 
 
262
 
263
  // If nothing was found, it likely means that now the visible table has less rows
264
  // than the full table. This happens when the user filters the table. In this case,
@@ -282,6 +713,7 @@ function format_model_link() {{
282
  // Replace the innerHTML of the cell with the interpreted HTML.
283
  cell.replaceChildren(model_anchor);
284
  }}
 
285
 
286
  // Return all arguments as is.
287
  return arguments
@@ -365,25 +797,26 @@ table th:first-child {
365
  }
366
  """
367
 
368
- intro_text = """
369
- <h2>How much energy do modern Large Language Models (LLMs) consume for inference?</h2>
370
-
371
- <p style="font-size: 16px">We used <a href="https://ml.energy/zeus">Zeus</a> to benchmark various open source LLMs in terms of how much time and energy they consume for inference.
372
- Time and energy are of course not the only things we care about -- so we also benchmarked all of the models on a variety of NLP datasets,
373
- including the ARC Challenge (reasoning), HellaSwag (common sense), and TruthfulQA (truthfulness).</p>
374
-
375
- <p style="font-size: 16px">For more detailed information, please take a look at the <b>About</b> tab.
376
- Every benchmark is limited in some sense -- Before you interpret the results, please take a look at the <b>Limitations</b> section there, too.</p>
377
- """
378
-
379
  # The app will not start without a controller address set.
380
  controller_addr = os.environ.get("COLOSSEUM_CONTROLLER_ADDR")
381
  if controller_addr is None:
382
  COLOSSEUM_UP = False
383
- COLOSSEUM_DOWN_MESSAGE = "<br/><h2 style='text-align: center'>Disabled Colosseum for local testing.</h2>"
384
  controller_addr = "localhost"
385
  global_controller_client = ControllerClient(controller_addr=controller_addr, timeout=15)
386
 
 
 
 
 
 
 
 
 
 
 
 
 
387
  # Load the list of models. To reload, the app should be restarted.
388
  RANDOM_MODEL_NAME = "Random"
389
  RANDOM_USER_PREFERENCE = "Two random models"
@@ -392,12 +825,19 @@ model_name_to_user_pref = {model: f"One is {model}" for model in global_availabl
392
  model_name_to_user_pref[RANDOM_MODEL_NAME] = RANDOM_USER_PREFERENCE
393
  user_pref_to_model_name = {v: k for k, v in model_name_to_user_pref.items()}
394
 
 
395
  # Colosseum helper functions.
396
- def enable_interact():
397
- return [gr.update(interactive=True)] * 2
 
 
 
 
 
 
 
 
398
 
399
- def disable_interact():
400
- return [gr.update(interactive=False)] * 2
401
 
402
  def consumed_less_energy_message(energy_a, energy_b):
403
  """Return a message that indicates that the user chose the model that consumed less energy.
@@ -410,6 +850,7 @@ def consumed_less_energy_message(energy_a, energy_b):
410
  how_much = f"{1 / factor:.1f}x" if factor <= 0.5 else f"{100 - factor * 100:.1f}%"
411
  return f"<h2>That response also <span class='green-text'>consumed {how_much} less energy</span> ({energy_a:,.0f} J vs. {energy_b:,.0f} J)!</h2>"
412
 
 
413
  def consumed_more_energy_message(energy_a, energy_b):
414
  """Return a message that indicates that the user chose the model that consumed more energy.
415
 
@@ -421,14 +862,23 @@ def consumed_more_energy_message(energy_a, energy_b):
421
  how_much = f"{factor:.1f}x" if factor >= 2.0 else f"{factor * 100 - 100:.1f}%"
422
  return f"<h2>That response <span class='red-text'>consumed {how_much} more energy</span> ({energy_a:,.0f} J vs. {energy_b:,.0f} J).</h2>"
423
 
 
424
  # Colosseum event handlers
425
  def on_load():
426
  """Intialize the dataframe, shuffle the model preference dropdown choices."""
427
- dataframe = global_tbm.set_filter_get_df()
 
428
  available_models = copy.deepcopy(global_available_models)
429
  random.shuffle(available_models)
430
  available_models.insert(0, RANDOM_MODEL_NAME)
431
- return dataframe, gr.Dropdown.update(choices=[model_name_to_user_pref[model] for model in available_models])
 
 
 
 
 
 
 
432
 
433
  def add_prompt_disable_submit(prompt, history_a, history_b):
434
  """Add the user's prompt to the two model's history and disable further submission."""
@@ -442,12 +892,17 @@ def add_prompt_disable_submit(prompt, history_a, history_b):
442
  client,
443
  ]
444
 
 
445
  def generate_responses(client: ControllerClient, user_preference, history_a, history_b):
446
  """Generate responses for the two models."""
447
  model_preference = user_pref_to_model_name[user_preference]
448
  for resp_a, resp_b in itertools.zip_longest(
449
- client.prompt(prompt=history_a[-1][0], index=0, model_preference=model_preference),
450
- client.prompt(prompt=history_b[-1][0], index=1, model_preference=model_preference),
 
 
 
 
451
  ):
452
  if resp_a is not None:
453
  history_a[-1][1] += resp_a
@@ -455,8 +910,10 @@ def generate_responses(client: ControllerClient, user_preference, history_a, his
455
  history_b[-1][1] += resp_b
456
  yield [history_a, history_b]
457
 
 
458
  def make_resp_vote_func(victory_index: Literal[0, 1]):
459
  """Return a function that will be called when the user clicks on response preference vote buttons."""
 
460
  def resp_vote_func(client: ControllerClient):
461
  vote_response = client.response_vote(victory_index=victory_index)
462
  model_name_a, model_name_b = map(lambda n: f"## {n}", vote_response.model_names)
@@ -491,10 +948,13 @@ def make_resp_vote_func(victory_index: Literal[0, 1]):
491
  # Keep the reset button disabled
492
  gr.Button.update(visible=False, interactive=False),
493
  ]
 
494
  return resp_vote_func
495
 
 
496
  def make_energy_vote_func(is_worth: bool):
497
  """Return a function that will be called when the user clicks on energy vote buttons."""
 
498
  def energy_vote_func(client: ControllerClient, energy_message: str):
499
  vote_response = client.energy_vote(is_worth=is_worth)
500
  model_name_a, model_name_b = map(lambda n: f"## {n}", vote_response.model_names)
@@ -508,8 +968,10 @@ def make_energy_vote_func(is_worth: bool):
508
  # Append to the energy comparison message
509
  energy_message[:-5] + (" Fair enough.</h2>" if is_worth else " Wasn't worth it.</h2>"),
510
  ]
 
511
  return energy_vote_func
512
 
 
513
  def play_again():
514
  available_models = copy.deepcopy(global_available_models)
515
  random.shuffle(available_models)
@@ -524,11 +986,16 @@ def play_again():
524
  # Hide energy vote buttons and message
525
  gr.Button.update(visible=False), gr.Button.update(visible=False), gr.Markdown.update(visible=False),
526
  # Enable model preference dropdown and shuffle choices
527
- gr.Dropdown.update(value=RANDOM_USER_PREFERENCE, choices=[model_name_to_user_pref[model] for model in available_models], interactive=True),
 
 
 
 
528
  # Disable reset button
529
  gr.Button.update(interactive=False, visible=False),
530
  ]
531
 
 
532
  focus_prompt_input_js = """
533
  function() {
534
  for (let textarea of document.getElementsByTagName("textarea")) {
@@ -541,13 +1008,17 @@ function() {
541
  """
542
 
543
  with gr.Blocks(css=custom_css) as block:
544
- tbm = gr.State(global_tbm) # type: ignore
 
 
545
  with gr.Box():
546
- gr.HTML("<h1><a href='https://ml.energy' class='text-logo'>ML.ENERGY</a> Leaderboard</h1>")
 
 
547
 
548
  with gr.Tabs():
549
  # Tab: Colosseum.
550
- with gr.TabItem("Colosseum ⚔️️"):
551
  if COLOSSEUM_UP:
552
  gr.Markdown(open("docs/colosseum_top.md").read())
553
  else:
@@ -587,32 +1058,64 @@ with gr.Blocks(css=custom_css) as block:
587
  resp_vote_btn_list: list[gr.component.Component] = []
588
  with gr.Column():
589
  with gr.Row():
590
- masked_model_names.append(gr.Markdown(visible=False, elem_classes=["model-name-text"]))
 
 
591
  with gr.Row():
592
- chatbots.append(gr.Chatbot(label="Model A", elem_id="chatbot", height=400, elem_classes=None if COLOSSEUM_UP else ["greyed-out"]))
 
 
 
 
 
 
 
593
  with gr.Row():
594
- left_resp_vote_btn = gr.Button(value="👈 Model A is better", interactive=False)
 
 
595
  resp_vote_btn_list.append(left_resp_vote_btn)
596
 
597
  with gr.Column():
598
  with gr.Row():
599
- masked_model_names.append(gr.Markdown(visible=False, elem_classes=["model-name-text"]))
 
 
600
  with gr.Row():
601
- chatbots.append(gr.Chatbot(label="Model B", elem_id="chatbot", height=400, elem_classes=None if COLOSSEUM_UP else ["greyed-out"]))
 
 
 
 
 
 
 
602
  with gr.Row():
603
- right_resp_vote_btn = gr.Button(value="👉 Model B is better", interactive=False)
 
 
604
  resp_vote_btn_list.append(right_resp_vote_btn)
605
 
606
  with gr.Row():
607
  energy_comparison_message = gr.HTML(visible=False)
608
 
609
  with gr.Row():
610
- worth_energy_vote_btn = gr.Button(value="The better response was worth 👍 the extra energy.", visible=False)
611
- notworth_energy_vote_btn = gr.Button(value="Not really worth that much more. 👎", visible=False)
612
- energy_vote_btn_list: list[gr.component.Component] = [worth_energy_vote_btn, notworth_energy_vote_btn]
 
 
 
 
 
 
 
 
613
 
614
  with gr.Row():
615
- play_again_btn = gr.Button("Play again!", visible=False, elem_classes=["btn-submit"])
 
 
616
 
617
  gr.Markdown(open("docs/colosseum_bottom.md").read())
618
 
@@ -622,11 +1125,11 @@ with gr.Blocks(css=custom_css) as block:
622
  (prompt_input
623
  .submit(add_prompt_disable_submit, [prompt_input, *chatbots], [prompt_input, prompt_submit_btn, model_preference_dropdown, *chatbots, controller_client], queue=False)
624
  .then(generate_responses, [controller_client, model_preference_dropdown, *chatbots], [*chatbots], queue=True, show_progress="hidden")
625
- .then(enable_interact, None, resp_vote_btn_list, queue=False))
626
  (prompt_submit_btn
627
  .click(add_prompt_disable_submit, [prompt_input, *chatbots], [prompt_input, prompt_submit_btn, model_preference_dropdown, *chatbots, controller_client], queue=False)
628
  .then(generate_responses, [controller_client, model_preference_dropdown, *chatbots], [*chatbots], queue=True, show_progress="hidden")
629
- .then(enable_interact, None, resp_vote_btn_list, queue=False))
630
 
631
  left_resp_vote_btn.click(
632
  make_resp_vote_func(victory_index=0),
@@ -663,128 +1166,100 @@ with gr.Blocks(css=custom_css) as block:
663
  )
664
  .then(None, _js=focus_prompt_input_js, queue=False))
665
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
666
 
667
- # Tab: Leaderboard.
668
- with gr.Tab("Leaderboard"):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
669
  with gr.Box():
670
- gr.HTML(intro_text)
671
 
672
  # Block: Checkboxes to select benchmarking parameters.
673
  with gr.Row():
674
  with gr.Box():
675
  gr.Markdown("### Benchmark results to show")
676
  checkboxes: list[gr.CheckboxGroup] = []
677
- for key, choices in global_tbm.schema.items():
678
  # Specifying `value` makes everything checked by default.
679
- checkboxes.append(gr.CheckboxGroup(choices=choices, value=choices[:1], label=key))
 
 
 
 
680
 
681
  # Block: Leaderboard table.
682
  with gr.Row():
683
- dataframe = gr.Dataframe(type="pandas", elem_id="tab-leaderboard", interactive=False)
 
 
684
  # Make sure the models have clickable links.
685
  dataframe.change(None, None, None, _js=dataframe_update_js, queue=False)
686
  # Table automatically updates when users check or uncheck any checkbox.
687
  for checkbox in checkboxes:
688
- checkbox.change(TableManager.set_filter_get_df, inputs=[tbm, *checkboxes], outputs=dataframe, queue=False)
689
-
690
- # Block: Allow users to add new columns.
691
- with gr.Box():
692
- gr.Markdown("### Add custom columns to the table")
693
- with gr.Row():
694
- with gr.Column(scale=3):
695
- with gr.Row():
696
- colname_input = gr.Textbox(lines=1, label="Custom column name")
697
- formula_input = gr.Textbox(lines=1, label="Formula (@sum, @len, @max, and @min are supported)")
698
- with gr.Column(scale=1):
699
- with gr.Row():
700
- add_col_btn = gr.Button("Add to table (⏎)", elem_classes=["btn-submit"])
701
- with gr.Row():
702
- clear_input_btn = gr.Button("Clear")
703
- with gr.Row():
704
- add_col_message = gr.HTML("")
705
- gr.Examples(
706
- examples=[
707
- ["power", "energy / latency"],
708
- ["token_per_joule", "response_length / energy"],
709
- ["verbose", "response_length > @sum(response_length) / @len(response_length)"],
710
- ],
711
- inputs=[colname_input, formula_input],
712
- )
713
- colname_input.submit(
714
- TableManager.add_column,
715
- inputs=[tbm, colname_input, formula_input],
716
- outputs=[dataframe, add_col_message],
717
- queue=False,
718
- )
719
- formula_input.submit(
720
- TableManager.add_column,
721
- inputs=[tbm, colname_input, formula_input],
722
- outputs=[dataframe, add_col_message],
723
- queue=False,
724
- )
725
- add_col_btn.click(
726
- TableManager.add_column,
727
- inputs=[tbm, colname_input, formula_input],
728
- outputs=[dataframe, add_col_message],
729
- queue=False,
730
- )
731
- clear_input_btn.click(
732
- lambda: (None, None, None),
733
- inputs=None,
734
- outputs=[colname_input, formula_input, add_col_message],
735
- queue=False,
736
- )
737
-
738
- # Block: Allow users to plot 2D and 3D scatter plots.
739
- with gr.Box():
740
- gr.Markdown("### Scatter plot (Hover over marker to show model name)")
741
- with gr.Row():
742
- with gr.Column(scale=3):
743
- with gr.Row():
744
- # Initialize the dropdown choices with the global TableManager with just the original columns.
745
- axis_dropdowns = global_tbm.get_dropdown()
746
- with gr.Column(scale=1):
747
- with gr.Row():
748
- plot_btn = gr.Button("Plot", elem_classes=["btn-submit"])
749
- with gr.Row():
750
- clear_plot_btn = gr.Button("Clear")
751
- with gr.Accordion("Plot size (600 x 600 by default)", open=False):
752
- with gr.Row():
753
- plot_width_input = gr.Textbox("600", lines=1, label="Width (px)")
754
- plot_height_input = gr.Textbox("600", lines=1, label="Height (px)")
755
- with gr.Row():
756
- plot = gr.Plot(value=global_tbm.plot_scatter(
757
- plot_width_input.value,
758
- plot_height_input.value,
759
- x=axis_dropdowns[0].value,
760
- y=axis_dropdowns[1].value,
761
- z=axis_dropdowns[2].value,
762
- )[0]) # type: ignore
763
- with gr.Row():
764
- plot_message = gr.HTML("")
765
- add_col_btn.click(TableManager.update_dropdown, inputs=tbm, outputs=axis_dropdowns, queue=False) # type: ignore
766
- plot_width_input.submit(
767
- TableManager.plot_scatter,
768
- inputs=[tbm, plot_width_input, plot_height_input, *axis_dropdowns],
769
- outputs=[plot, plot_width_input, plot_height_input, plot_message],
770
- queue=False,
771
- )
772
- plot_height_input.submit(
773
- TableManager.plot_scatter,
774
- inputs=[tbm, plot_width_input, plot_height_input, *axis_dropdowns],
775
- outputs=[plot, plot_width_input, plot_height_input, plot_message],
776
- queue=False,
777
- )
778
- plot_btn.click(
779
- TableManager.plot_scatter,
780
- inputs=[tbm, plot_width_input, plot_height_input, *axis_dropdowns],
781
- outputs=[plot, plot_width_input, plot_height_input, plot_message],
782
- queue=False,
783
- )
784
- clear_plot_btn.click(
785
- lambda: (None,) * 7,
786
- None,
787
- outputs=[*axis_dropdowns, plot, plot_width_input, plot_height_input, plot_message],
788
  queue=False,
789
  )
790
 
@@ -794,8 +1269,7 @@ with gr.Blocks(css=custom_css) as block:
794
 
795
  # Tab: About page.
796
  with gr.Tab("About"):
797
- # Read in LEADERBOARD.md
798
- gr.Markdown(open("docs/leaderboard.md").read())
799
 
800
  # Citation
801
  with gr.Accordion("📚 Citation", open=False, elem_id="citation-header"):
@@ -809,13 +1283,21 @@ with gr.Blocks(css=custom_css) as block:
809
  )
810
 
811
  # Load the table on page load.
812
- block.load(on_load, outputs=[dataframe, model_preference_dropdown], queue=False)
 
 
 
 
813
 
814
 
815
  if __name__ == "__main__":
816
  parser = argparse.ArgumentParser()
817
- parser.add_argument("--share", action="store_true", help="Specify if sharing is enabled")
 
 
818
  parser.add_argument("--concurrency", type=int, default=50)
819
 
820
  args = parser.parse_args()
821
- block.queue(concurrency_count=args.concurrency, api_open=False).launch(share=args.share, show_error=True)
 
 
 
1
+ """Gradio app for the ML.ENERGY leaderboard.
2
+
3
+ Everything is in a single file. Search for `gr.Blocks` to find the place
4
+ where UI elements are actually defined.
5
+ """
6
+
7
  from __future__ import annotations
8
 
9
+ from abc import abstractmethod
10
  import copy
11
  import json
12
  import random
 
16
  import contextlib
17
  import argparse
18
  import os
19
+ from pathlib import Path
20
+ from typing import Literal, Any
21
  from dateutil import parser, tz
22
 
23
  import numpy as np
24
  import gradio as gr
25
  import pandas as pd
 
 
 
 
26
 
27
  from spitfight.colosseum.client import ControllerClient
28
 
 
32
 
33
 
34
  class TableManager:
35
+ """Manages the data for the leaderboard tables for tasks."""
36
+
37
+ def __init__(self, data_dir: str) -> None:
38
+ """Load leaderboard data from files in `data_dir`.
39
+
40
+ Expected directory structure: `data_dir/gpu_model`.
41
+ Inside the innermost (GPU) directory, there should be:
42
+ - `models.json`: JSON file that maps huggingface model IDs to model info.
43
+ Some models listed in this file may not have benchmark results.
44
+ - `model_org/model_name/*.json`: JSON files containing the benchmark results.
45
+ """
46
+ self.data_dir = Path(data_dir)
47
+
48
+ def __str__(self) -> str:
49
+ return f"{self.__class__}(data_dir={self.data_dir})"
50
+
51
+ def _wrap_model_name(self, url: str, model_name: str) -> str:
52
+ """Wrap the model name in an HTML anchor."""
53
+ return f'<a style="text-decoration: underline; text-decoration-style: dotted" target="_blank" href="{url}">{model_name}</a>'
54
+
55
+ def _unwrap_model_name(self, model_name: str) -> str:
56
+ """Unwrap the model name from an HTML anchor."""
57
+ return model_name.split(">")[1].split("<")[0]
58
+
59
+ @abstractmethod
60
+ def get_tab_name(self) -> str:
61
+ """Return the name of the leaderboard."""
62
+
63
+ @abstractmethod
64
+ def get_intro_text(self) -> tuple[str, str]:
65
+ """Return the type of the introduction text and the introduction text."""
66
+
67
+ @abstractmethod
68
+ def get_detail_text(self) -> tuple[str, str]:
69
+ """Return the type of the detail text and the detail text."""
70
+
71
+ def get_benchmark_checkboxes(self) -> dict[str, list[str]]:
72
+ """Return data for the benchmark selection checkboxes."""
73
+ return {}
74
+
75
+ def get_benchmark_sliders(self) -> dict[str, tuple[float, float, float, float]]:
76
+ """Return data for the benchmark selection sliders.
77
+
78
+ Dictionary values are tuples of the form (min, max, step, default).
79
+ """
80
+ return {}
81
+
82
+ @abstractmethod
83
+ def get_all_models(self) -> list[str]:
84
+ """Return all available models."""
85
+
86
+ @abstractmethod
87
+ def set_filter_get_df(self, *filters) -> pd.DataFrame:
88
+ """Set the current set of filters and return the filtered DataFrame."""
89
+
90
+
91
+ class LLMTableManager(TableManager):
92
+ def __init__(self, data_dir: str, task_name: str) -> None:
93
+ """Load leaderboard data from files in `data_dir`.
94
+
95
+ Under `data_dir`, there should be:
96
+ - `models.json`: JSON file that maps huggingface model IDs to model info.
97
+ Some models listed in this file may not have benchmark results.
98
+ - `schema.yaml`: YAML file containing the schema of the benchmark.
99
+
100
+ Then, benchmark data files are nested under `data_dir` according to the schema.
101
+ One directory hierarchy for each choice in the schema and then two more -- the
102
+ model's HuggingFace hub organization and the model name.
103
+ """
104
+ super().__init__(data_dir)
105
+
106
+ self.task_name = task_name
107
+
108
+ # Read in the data into a Pandas DataFrame.
109
+ # Important: The ordering `self.schema` determines the directory structure.
110
+ self.schema = yaml.safe_load(open(self.data_dir / "schema.yaml"))
111
+ models: dict[str, dict[str, Any]] = json.load(
112
+ open(self.data_dir / "models.json")
113
+ )
114
+ res_df = pd.DataFrame()
115
+ for choice in itertools.product(*self.schema.values()):
116
+ result_dir = self.data_dir / "/".join(choice)
117
+ with contextlib.suppress(FileNotFoundError):
118
+ for model_id, model_info in models.items():
119
+ for file in (result_dir / model_id).glob("*.json"):
120
+ model_df = pd.DataFrame([json.load(open(file))])
121
+ # Sanity checks and standardization of schema values.
122
+ assert model_df["Model"].iloc[0] == model_id
123
+ for key, val in zip(self.schema.keys(), choice):
124
+ assert (
125
+ str(val).lower() in str(model_df[key].iloc[0]).lower()
126
+ )
127
+ model_df[key] = val
128
+ # Format the model name as an HTML anchor.
129
+ model_df["Model"] = self._wrap_model_name(model_info["url"], model_info["nickname"])
130
+ model_df["Params"] = model_info["params"]
131
+ res_df = pd.concat([res_df, model_df])
132
+
133
+ if res_df.empty:
134
+ raise ValueError(
135
+ f"No benchmark JSON files were read from {self.data_dir=}."
136
+ )
137
+
138
+ # Order columns
139
+ columns = res_df.columns.to_list()
140
+ cols_to_order = ["Model", "Params"]
141
+ cols_to_order.extend(self.schema.keys())
142
+ columns = cols_to_order + [col for col in columns if col not in cols_to_order]
143
+ res_df = res_df[columns]
144
+
145
+ # Order rows
146
+ res_df = res_df.sort_values(by=["Model", *self.schema.keys(), "Energy/req (J)"])
147
+
148
+ self.cur_df = self.full_df = res_df.round(2)
149
+
150
+ # We need to set the default view separately when `gr.State` is forked.
151
+ self.set_filter_get_df()
152
+
153
+ def get_benchmark_checkboxes(self) -> dict[str, list[str]]:
154
+ return self.schema
155
+
156
+ def get_benchmark_sliders(self) -> dict[str, tuple[float, float, float, float]]:
157
+ return {"Target Time Per Output Token (TPOT) (s)": (0.0, 0.5, 0.01, 0.2)}
158
+
159
+ def get_all_models(self) -> list[str]:
160
+ return self.full_df["Model"].apply(self._unwrap_model_name).unique().tolist()
161
+
162
+ def set_filter_get_df(self, *filters) -> pd.DataFrame:
163
+ """Set the current set of filters and return the filtered DataFrame.
164
+
165
+ Filters can either be completely empty, or be a concatenated list of
166
+ choices from all checkboxes and all sliders.
167
+ """
168
+ # If the filter is empty, we default to the first choice for each checkbox.
169
+ if not filters:
170
+ checkboxes = [choices[:1] for choices in self.schema.values()]
171
+ sliders = [slider[3] for slider in self.get_benchmark_sliders().values()]
172
+ filters = checkboxes + sliders
173
+
174
+ index = np.full(len(self.full_df), True)
175
+ # Checkboxes
176
+ for setup, choice in zip(self.schema, filters):
177
+ index = index & self.full_df[setup].isin(choice)
178
+ self.cur_df = self.full_df.loc[index]
179
+
180
+ # Sliders (We just have TPOT for now.)
181
+ # For each `Model`, we want to first filter out rows whose `Avg TPOT (s)` is greater than the slider value.
182
+ # Finally, only just leave the row whose `Energy/req (J)` is the smallest.
183
+ tpot_slo = filters[-1]
184
+ self.cur_df = (
185
+ self.cur_df
186
+ .groupby("Model")[self.cur_df.columns]
187
+ .apply(lambda x: x[x["Avg TPOT (s)"] <= tpot_slo], include_groups=True)
188
+ .sort_values(by="Energy/req (J)")
189
+ .reset_index(drop=True)
190
+ .groupby("Model")
191
+ .head(1)
192
+ )
193
+
194
+ return self.cur_df
195
+
196
+ def get_detail_text(self) -> tuple[str, str]:
197
+ text = """
198
+ Columns
199
+ - **Model**: The name of the model.
200
+ - **GPU**: Name of the GPU model used for benchmarking.
201
+ - **Params**: Number of parameters in the model.
202
+ - **TP**: Tensor parallelism degree.
203
+ - **PP**: Pipeline parallelism degree. (TP * PP is the total number of GPUs used.)
204
+ - **Energy/req (J)**: Energy consumed per request in Joules.
205
+ - **Avg TPOT (s)**: Average time per output token in seconds.
206
+ - **Token tput (toks/s)**: Average number of tokens generated by the engine per second.
207
+ - **Avg Output Tokens**: Average number of output tokens in the LLM's response.
208
+ - **Avg BS**: Average batch size of the serving engine over time.
209
+ - **Max BS**: Maximum batch size configuration of the serving engine.
210
+
211
+ For more detailed information, please take a look at the **About** tab.
212
+ """
213
+ return "markdown", text
214
+
215
+
216
+ class LLMChatTableManager(LLMTableManager):
217
+ """LLM table manager for chat tasks."""
218
+
219
+ def get_tab_name(self) -> str:
220
+ return "LLM Chat"
221
+
222
+ def get_intro_text(self) -> tuple[str, str]:
223
+ text = """
224
+ <h2>How much energy do GenAI models consume?</h2>
225
+
226
+ <h3>LLM chatbot response generation</h3>
227
+
228
+ <p style="font-size: 16px">
229
+ We used <a href="https://ml.energy/zeus">Zeus</a> to benchmark various instruction-tuned LLMs in terms of how much time and energy they consume for inference.
230
+ </p>
231
+
232
+ <p style="font-size: 16px">
233
+ An average Time Per Output Token (TPOT) of 0.20 seconds roughly corresponds to a person reading at 240 words per minute and 1.3 tokens per word.
234
+ </p>
235
+ """
236
+ return "html", text
237
+
238
+
239
+ class LLMCodeTableManager(LLMTableManager):
240
+ """LLM table manager for coding tasks."""
241
+
242
+ def get_tab_name(self) -> str:
243
+ return "LLM Code"
244
+
245
+ def get_intro_text(self) -> tuple[str, str]:
246
+ text = """
247
+ <h2>How much energy do GenAI models consume?</h2>
248
+
249
+ <h3>LLM code generation</h3>
250
+
251
+ <p style="font-size: 16px">
252
+ We used <a href="https://ml.energy/zeus">Zeus</a> to benchmark various LLMs specialized for coding in terms of how much time and energy they consume for inference.
253
+ </p>
254
+
255
+ <p style="font-size: 16px">
256
+ An average Time Per Output Token (TPOT) of 0.20 seconds roughly corresponds to a person reading at 240 words per minute and 1.3 tokens per word.
257
+ </p>
258
+ """
259
+ return "html", text
260
+
261
+
262
+ class VLMChatTableManager(LLMTableManager):
263
+ """VLM table manager for chat tasks."""
264
+
265
+ def get_tab_name(self) -> str:
266
+ return "VLM Visual Chat"
267
+
268
+ def get_intro_text(self) -> tuple[str, str]:
269
+ text = """
270
+ <h2>How much energy do GenAI models consume?</h2>
271
+
272
+ <h3>VLM visual chatbot response generation</h3>
273
+
274
+ <p style="font-size: 16px">
275
+ We used <a href="https://ml.energy/zeus">Zeus</a> to benchmark various Vision Language Models (VLMs) in terms of how much time and energy they consume for inference.
276
+ </p>
277
+
278
+ <p style="font-size: 16px">
279
+ A Time Per Output Token (TPOT) of 0.2 seconds roughly corresponds to a person reading at 240 words per minute and 1.3 tokens per word.
280
+ </p>
281
+ """
282
+ return "html", text
283
+
284
+
285
+ class DiffusionTableManager(TableManager):
286
+ def __init__(self, data_dir: str, task_name: str) -> None:
287
+ """Load leaderboard data from files in `data_dir`.
288
+
289
+ Under `data_dir`, there should be:
290
+ - `models.json`: JSON file that maps huggingface model IDs to model info.
291
+ Some models listed in this file may not have benchmark results.
292
+ - `schema.yaml`: YAML file containing the schema of the benchmark.
293
+
294
+ Then, benchmark data files are nested under `data_dir` according to the schema.
295
+ One directory hierarchy for each choice in the schema and then two more -- the
296
+ model's HuggingFace hub organization and the model name.
297
+ """
298
+ super().__init__(data_dir)
299
+
300
+ self.task_name = task_name
301
+
302
+ if "to video" in task_name.lower():
303
+ self.energy_col = "Energy/video (J)"
304
+ elif "to image" in task_name.lower():
305
+ self.energy_col = "Energy/image (J)"
306
+ else:
307
+ raise ValueError(f"Unknown task name: {task_name=}")
308
+
309
+ # Read in the data into a Pandas DataFrame.
310
+ # Important: The ordering `self.schema` determines the directory structure.
311
+ self.schema = yaml.safe_load(open(self.data_dir / "schema.yaml"))
312
+ models: dict[str, dict[str, Any]] = json.load(
313
+ open(self.data_dir / "models.json")
314
+ )
315
+ res_df = pd.DataFrame()
316
+ for choice in itertools.product(*self.schema.values()):
317
+ result_dir = self.data_dir / "/".join(choice)
318
+ with contextlib.suppress(FileNotFoundError):
319
+ for model_id, model_info in models.items():
320
+ for file in (result_dir / model_id).glob("*.json"):
321
+ model_df = pd.DataFrame([json.load(open(file))])
322
+ # Sanity checks and standardization of schema values.
323
+ assert model_df["Model"].iloc[0] == model_id
324
+ for key, val in zip(self.schema.keys(), choice):
325
+ assert (
326
+ str(val).lower() in str(model_df[key].iloc[0]).lower()
327
+ )
328
+ model_df[key] = val
329
+ # Format the model name as an HTML anchor.
330
+ model_df["Model"] = self._wrap_model_name(model_info["url"], model_info["nickname"])
331
+ model_df["Total params"] = model_info["total_params"]
332
+ model_df["Denoising params"] = model_info["denoising_params"]
333
+ model_df["Resolution"] = model_info["resolution"]
334
+ res_df = pd.concat([res_df, model_df])
335
+
336
+ if res_df.empty:
337
+ raise ValueError(
338
+ f"No benchmark JSON files were read from {self.data_dir=}."
339
+ )
340
+
341
+ # Order columns
342
+ columns = res_df.columns.to_list()
343
+ cols_to_order = ["Model", "Denoising params", "Total params"]
344
+ cols_to_order.extend(self.schema.keys())
345
+ columns = cols_to_order + [col for col in columns if col not in cols_to_order]
346
+ res_df = res_df[columns]
347
+
348
+ # Order rows
349
+ res_df = res_df.sort_values(by=["Model", *self.schema.keys(), self.energy_col])
350
+
351
+ self.cur_df = self.full_df = res_df.round(2)
352
+
353
+ # We need to set the default view separately when `gr.State` is forked.
354
+ self.set_filter_get_df()
355
+
356
+ def get_benchmark_checkboxes(self) -> dict[str, list[str]]:
357
+ return self.schema
358
+
359
+ def get_all_models(self) -> list[str]:
360
+ return self.full_df["Model"].apply(self._unwrap_model_name).unique().tolist()
361
+
362
+ def set_filter_get_df(self, *filters) -> pd.DataFrame:
363
+ """Set the current set of filters and return the filtered DataFrame.
364
+
365
+ Filters can either be completely empty, or be a concatenated list of
366
+ choices from all checkboxes and all sliders.
367
+ """
368
+ # If the filter is empty, we default to the first choice for each key.
369
+ if not filters:
370
+ checkboxes = [choices[:1] for choices in self.schema.values()]
371
+ sliders = [slider[3] for slider in self.get_benchmark_sliders().values()]
372
+ filters = checkboxes + sliders
373
+
374
+ index = np.full(len(self.full_df), True)
375
+ # Checkboxes
376
+ for setup, choice in zip(self.schema, filters):
377
+ index = index & self.full_df[setup].isin(choice)
378
+ self.cur_df = self.full_df.loc[index]
379
+
380
+ # Sliders (We just have Batch latency for now.)
381
+ # For each `Model`, we want to first filter out rows whose `Batch latency (s)` is greater than the slider value.
382
+ # Finally, only just leave the row whose `Energy/image (J)` or `Energy/video (J)` is the smallest.
383
+ batch_latency = filters[-1]
384
+ self.cur_df = (
385
+ self.cur_df
386
+ .groupby("Model")[self.cur_df.columns]
387
+ .apply(
388
+ lambda x: x[x["Batch latency (s)"] <= batch_latency],
389
+ include_groups=True,
390
+ )
391
+ .sort_values(by=self.energy_col)
392
+ .reset_index(drop=True)
393
+ .groupby("Model")
394
+ .head(1)
395
+ )
396
+
397
+ return self.cur_df
398
+
399
+
400
+ class DiffusionT2ITableManager(DiffusionTableManager):
401
+ """Diffusion table manager for text-to-image tasks."""
402
+
403
+ def get_tab_name(self) -> str:
404
+ return "Diffusion Text to image"
405
+
406
+ def get_intro_text(self) -> tuple[str, str]:
407
+ text = """
408
+ <h2>Diffusion text-to-image generation</h2></br>
409
+
410
+ <p style="font-size: 16px">
411
+ We used <a href="https://ml.energy/zeus">Zeus</a> to benchmark various open source LLMs in terms of how much time and energy they consume for inference.
412
+ </p>
413
+
414
+ <p style="font-size: 16px">
415
+ The time and energy consumption of Diffusion models are affected by not only the size of the model, but also the number of denoising steps and the resolution of the generated images.
416
+ </p>
417
+ """
418
+ return "html", text
419
+
420
+ def get_detail_text(self) -> tuple[str, str]:
421
+ text = """
422
+ Columns
423
+ - **Model**: The name of the model.
424
+ - **Denoising params**: Number of parameters in the denosing module (e.g., UNet, Transformer).
425
+ - **Total params**: Total number of parameters in the model, including encoders and decoders.
426
+ - **GPU**: Name of the GPU model used for benchmarking.
427
+ - **Energy/image (J)**: Energy consumed per generated image in Joules.
428
+ - **Batch latency (s)**: Time taken to generate a batch of images in seconds.
429
+ - **Batch size**: Number of prompts/images in a batch.
430
+ - **Denoising steps**: Number of denoising steps used for the diffusion model.
431
+ - **Resolution**: Resolution of the generated image.
432
+
433
+ For more detailed information, please take a look at the **About** tab.
434
+ """
435
+ return "markdown", text
436
+
437
+ def get_benchmark_sliders(self) -> dict[str, tuple[float, float, float, float]]:
438
+ return {"Batch latency (s)": (0.0, 60.0, 1.0, 10.0)}
439
+
440
+
441
+ class DiffusionT2VTableManager(DiffusionTableManager):
442
+ """Diffusion table manager for text-to-video tasks."""
443
+
444
+ def get_tab_name(self) -> str:
445
+ return "Diffusion Text to video"
446
+
447
+ def get_intro_text(self) -> tuple[str, str]:
448
+ text = """
449
+ <h2>Diffusion text-to-video generation</h2></br>
450
+
451
+ <p style="font-size: 16px">
452
+ We used <a href="https://ml.energy/zeus">Zeus</a> to benchmark various open source LLMs in terms of how much time and energy they consume for inference.
453
+ </p>
454
+
455
+ <p style="font-size: 16px">
456
+ The time and energy consumption of Diffusion models are affected by not only the size of the model, but also the number of denoising steps, the resolution of the generated video, and the total number of frames in the video.
457
+ </p>
458
+ """
459
+ return "html", text
460
+
461
+ def get_detail_text(self) -> tuple[str, str]:
462
+ text = """
463
+ Columns
464
+ - **Model**: The name of the model.
465
+ - **Denoising params**: Number of parameters in the denosing module (e.g., UNet, Transformer).
466
+ - **Total params**: Total number of parameters in the model, including encoders and decoders.
467
+ - **GPU**: Name of the GPU model used for benchmarking.
468
+ - **Energy/video (J)**: Energy consumed per generated video in Joules.
469
+ - **Batch latency (s)**: Time taken to generate a batch of videos in seconds.
470
+ - **Batch size**: Number of prompts/videos in a batch.
471
+ - **Denoising steps**: Number of denoising steps used for the diffusion model.
472
+ - **Frames**: Number of frames in the generated video.
473
+ - **Resolution**: Resolution of the generated video.
474
+
475
+ For more detailed information, please take a look at the **About** tab.
476
+ """
477
+ return "markdown", text
478
+
479
+ def get_benchmark_sliders(self) -> dict[str, tuple[float, float, float, float]]:
480
+ return {"Batch latency (s)": (0.0, 60.0, 1.0, 10.0)}
481
+
482
+
483
+ class DiffusionI2VTableManager(DiffusionTableManager):
484
+ """Diffusion table manager for image-to-video tasks."""
485
+
486
+ def get_tab_name(self) -> str:
487
+ return "Diffusion Image to video"
488
+
489
+ def get_intro_text(self) -> tuple[str, str]:
490
+ text = """
491
+ <h2>Diffusion image-to-video generation</h2></br>
492
+
493
+ <p style="font-size: 16px">
494
+ We used <a href="https://ml.energy/zeus">Zeus</a> to benchmark various open source LLMs in terms of how much time and energy they consume for inference.
495
+ </p>
496
+
497
+ <p style="font-size: 16px">
498
+ The time and energy consumption of Diffusion models are affected by not only the size of the model, but also the number of denoising steps, the resolution of the generated video, and the total number of frames in the video.
499
+ </p>
500
+ """
501
+ return "html", text
502
+
503
+ def get_detail_text(self) -> tuple[str, str]:
504
+ text = """
505
+ Columns
506
+ - **Model**: The name of the model.
507
+ - **Denoising params**: Number of parameters in the denosing module (e.g., UNet, Transformer).
508
+ - **Total params**: Total number of parameters in the model, including encoders and decoders.
509
+ - **GPU**: Name of the GPU model used for benchmarking.
510
+ - **Energy/video (J)**: Energy consumed per generated video in Joules.
511
+ - **Batch latency (s)**: Time taken to generate a batch of videos in seconds.
512
+ - **Batch size**: Number of prompts/videos in a batch.
513
+ - **Denoising steps**: Number of denoising steps used for the diffusion model.
514
+ - **Frames**: Number of frames in the generated video.
515
+ - **Resolution**: Resolution of the generated video.
516
+
517
+ For more detailed information, please take a look at the **About** tab.
518
+ """
519
+ return "markdown", text
520
+
521
+ def get_benchmark_sliders(self) -> dict[str, tuple[float, float, float, float]]:
522
+ return {"Batch latency (s)": (0.0, 120.0, 1.0, 45.0)}
523
+
524
+
525
+ class LegacyTableManager:
526
  def __init__(self, data_dir: str) -> None:
527
+ """Load the legacy LLM leaderboard data from CSV files in data_dir.
528
 
529
  Inside `data_dir`, there should be:
530
  - `models.json`: a JSON file containing information about each model.
 
553
  f'<a style="text-decoration: underline; text-decoration-style: dotted" '
554
  f'target="_blank" href="{url}">{nickname}</a>'
555
  )
556
+
557
  df["model"] = df["model"].apply(format_model_link)
558
 
559
  # Sort by our 'energy efficiency' score.
 
606
  """Formats into HTML that prints in Monospace font."""
607
  return f"<pre style='font-family: monospace'>{text}</pre>"
608
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
609
  def get_dropdown(self):
610
  columns = self.full_df.columns.tolist()[1:]
611
  return [
 
635
  self.cur_index = index
636
  return self.cur_df
637
 
638
+ def get_intro_text(self) -> str:
639
+ """Return the leaderboard's introduction text in HTML."""
640
+ return """
641
+ <div align="center">
642
+ <h2 style="color: #23d175">This is the legacy ML.ENERGY LLM leaderboard. This will be removed by the end of the year.</h2>
643
+ </div>
644
+
645
+ <h3>How much energy do modern Large Language Models (LLMs) consume for inference?</h3>
646
+
647
+ <p style="font-size: 16px">
648
+ We used <a href="https://ml.energy/zeus">Zeus</a> to benchmark various open source LLMs in terms of how much time and energy they consume for inference.
649
+ </p>
650
+
651
+ <p style="font-size: 16px">
652
+ For more detailed information, please take a look at the <b>About</b> tab.
653
+ Every benchmark is limited in some sense -- Before you interpret the results, please take a look at the <b>Limitations</b> section there, too.
654
+ </p>
655
+ """
 
 
 
 
 
 
 
 
656
 
 
657
 
658
  # The global instance of the TableManager should only be used when
659
  # initializing components in the Gradio interface. If the global instance
660
  # is mutated while handling user sessions, the change will be reflected
661
  # in every user session. Instead, the instance provided by gr.State should
662
  # be used.
663
+ global_ltbm = LegacyTableManager("data/legacy")
664
+ global_tbms = [
665
+ LLMChatTableManager("data/llm_text_generation/chat", "Chat"),
666
+ LLMCodeTableManager("data/llm_text_generation/code", "Code"),
667
+ VLMChatTableManager("data/mllm_text_generation/chat", "Visual chat"),
668
+ DiffusionT2ITableManager("data/diffusion/text-to-image", "Text to image"),
669
+ DiffusionT2VTableManager("data/diffusion/text-to-video", "Text to video"),
670
+ DiffusionI2VTableManager("data/diffusion/image-to-video", "Image to video"),
671
+ ]
 
 
672
 
673
  # Custom JS.
674
  # XXX: This is a hack to make the model names clickable.
 
682
  dataframe_update_js = f"""
683
  function format_model_link() {{
684
  // Iterate over the cells of the first column of the leaderboard table.
685
+ var table_element = document.querySelectorAll(".tab-leaderboard");
686
+ for (var table of table_element) {{
687
+ for (let index = 1; index <= {len(global_ltbm.full_df) + sum(len(tbm.full_df) for tbm in global_tbms)}; index++) {{
688
+ // Get the cell from `table`.
689
+ var cell = table.querySelector(`div > div > div > table > tbody > tr:nth-child(${{index}}) > td:nth-child(1) > div > span`);
690
+ // var cell = document.querySelector(
691
+ // `.tab-leaderboard > div > div > div > table > tbody > tr:nth-child(${{index}}) > td:nth-child(1) > div > span`
692
+ // );
693
 
694
  // If nothing was found, it likely means that now the visible table has less rows
695
  // than the full table. This happens when the user filters the table. In this case,
 
713
  // Replace the innerHTML of the cell with the interpreted HTML.
714
  cell.replaceChildren(model_anchor);
715
  }}
716
+ }}
717
 
718
  // Return all arguments as is.
719
  return arguments
 
797
  }
798
  """
799
 
 
 
 
 
 
 
 
 
 
 
 
800
  # The app will not start without a controller address set.
801
  controller_addr = os.environ.get("COLOSSEUM_CONTROLLER_ADDR")
802
  if controller_addr is None:
803
  COLOSSEUM_UP = False
804
+ COLOSSEUM_DOWN_MESSAGE = "<br/><h2 style='text-align: center'>Local testing mode. Colosseum disabled.</h2>"
805
  controller_addr = "localhost"
806
  global_controller_client = ControllerClient(controller_addr=controller_addr, timeout=15)
807
 
808
+ # Fetch the latest update date of the leaderboard repository.
809
+ resp = requests.get("https://api.github.com/repos/ml-energy/leaderboard/commits/master")
810
+ if resp.status_code != 200:
811
+ current_date = "[Failed to fetch]"
812
+ print("Failed to fetch the latest release date of the leaderboard repository.")
813
+ print(resp.json())
814
+ else:
815
+ current_datetime = parser.parse(resp.json()["commit"]["author"]["date"])
816
+ current_date = current_datetime.astimezone(tz.gettz("US/Eastern")).strftime(
817
+ "%Y-%m-%d"
818
+ )
819
+
820
  # Load the list of models. To reload, the app should be restarted.
821
  RANDOM_MODEL_NAME = "Random"
822
  RANDOM_USER_PREFERENCE = "Two random models"
 
825
  model_name_to_user_pref[RANDOM_MODEL_NAME] = RANDOM_USER_PREFERENCE
826
  user_pref_to_model_name = {v: k for k, v in model_name_to_user_pref.items()}
827
 
828
+
829
  # Colosseum helper functions.
830
+ def enable_interact(num: int):
831
+ def inner():
832
+ return [gr.update(interactive=True)] * num
833
+ return inner
834
+
835
+
836
+ def disable_interact(num: int):
837
+ def inner():
838
+ return [gr.update(interactive=False)] * num
839
+ return inner
840
 
 
 
841
 
842
  def consumed_less_energy_message(energy_a, energy_b):
843
  """Return a message that indicates that the user chose the model that consumed less energy.
 
850
  how_much = f"{1 / factor:.1f}x" if factor <= 0.5 else f"{100 - factor * 100:.1f}%"
851
  return f"<h2>That response also <span class='green-text'>consumed {how_much} less energy</span> ({energy_a:,.0f} J vs. {energy_b:,.0f} J)!</h2>"
852
 
853
+
854
  def consumed_more_energy_message(energy_a, energy_b):
855
  """Return a message that indicates that the user chose the model that consumed more energy.
856
 
 
862
  how_much = f"{factor:.1f}x" if factor >= 2.0 else f"{factor * 100 - 100:.1f}%"
863
  return f"<h2>That response <span class='red-text'>consumed {how_much} more energy</span> ({energy_a:,.0f} J vs. {energy_b:,.0f} J).</h2>"
864
 
865
+
866
  # Colosseum event handlers
867
  def on_load():
868
  """Intialize the dataframe, shuffle the model preference dropdown choices."""
869
+ dataframe = global_ltbm.set_filter_get_df()
870
+ dataframes = [global_tbm.set_filter_get_df() for global_tbm in global_tbms]
871
  available_models = copy.deepcopy(global_available_models)
872
  random.shuffle(available_models)
873
  available_models.insert(0, RANDOM_MODEL_NAME)
874
+ return (
875
+ dataframe,
876
+ *dataframes,
877
+ gr.Dropdown.update(
878
+ choices=[model_name_to_user_pref[model] for model in available_models]
879
+ ),
880
+ )
881
+
882
 
883
  def add_prompt_disable_submit(prompt, history_a, history_b):
884
  """Add the user's prompt to the two model's history and disable further submission."""
 
892
  client,
893
  ]
894
 
895
+
896
  def generate_responses(client: ControllerClient, user_preference, history_a, history_b):
897
  """Generate responses for the two models."""
898
  model_preference = user_pref_to_model_name[user_preference]
899
  for resp_a, resp_b in itertools.zip_longest(
900
+ client.prompt(
901
+ prompt=history_a[-1][0], index=0, model_preference=model_preference
902
+ ),
903
+ client.prompt(
904
+ prompt=history_b[-1][0], index=1, model_preference=model_preference
905
+ ),
906
  ):
907
  if resp_a is not None:
908
  history_a[-1][1] += resp_a
 
910
  history_b[-1][1] += resp_b
911
  yield [history_a, history_b]
912
 
913
+
914
  def make_resp_vote_func(victory_index: Literal[0, 1]):
915
  """Return a function that will be called when the user clicks on response preference vote buttons."""
916
+
917
  def resp_vote_func(client: ControllerClient):
918
  vote_response = client.response_vote(victory_index=victory_index)
919
  model_name_a, model_name_b = map(lambda n: f"## {n}", vote_response.model_names)
 
948
  # Keep the reset button disabled
949
  gr.Button.update(visible=False, interactive=False),
950
  ]
951
+
952
  return resp_vote_func
953
 
954
+
955
  def make_energy_vote_func(is_worth: bool):
956
  """Return a function that will be called when the user clicks on energy vote buttons."""
957
+
958
  def energy_vote_func(client: ControllerClient, energy_message: str):
959
  vote_response = client.energy_vote(is_worth=is_worth)
960
  model_name_a, model_name_b = map(lambda n: f"## {n}", vote_response.model_names)
 
968
  # Append to the energy comparison message
969
  energy_message[:-5] + (" Fair enough.</h2>" if is_worth else " Wasn't worth it.</h2>"),
970
  ]
971
+
972
  return energy_vote_func
973
 
974
+
975
  def play_again():
976
  available_models = copy.deepcopy(global_available_models)
977
  random.shuffle(available_models)
 
986
  # Hide energy vote buttons and message
987
  gr.Button.update(visible=False), gr.Button.update(visible=False), gr.Markdown.update(visible=False),
988
  # Enable model preference dropdown and shuffle choices
989
+ gr.Dropdown.update(
990
+ value=RANDOM_USER_PREFERENCE,
991
+ choices=[model_name_to_user_pref[model] for model in available_models],
992
+ interactive=True,
993
+ ),
994
  # Disable reset button
995
  gr.Button.update(interactive=False, visible=False),
996
  ]
997
 
998
+
999
  focus_prompt_input_js = """
1000
  function() {
1001
  for (let textarea of document.getElementsByTagName("textarea")) {
 
1008
  """
1009
 
1010
  with gr.Blocks(css=custom_css) as block:
1011
+ tbm = gr.State(global_ltbm) # type: ignore
1012
+ local_tbms: list[TableManager] = [gr.State(global_tbm) for global_tbm in global_tbms] # type: ignore
1013
+
1014
  with gr.Box():
1015
+ gr.HTML(
1016
+ "<h1><a href='https://ml.energy' class='text-logo'>ML.ENERGY</a> Leaderboard</h1>"
1017
+ )
1018
 
1019
  with gr.Tabs():
1020
  # Tab: Colosseum.
1021
+ with gr.Tab("Colosseum ⚔️️"):
1022
  if COLOSSEUM_UP:
1023
  gr.Markdown(open("docs/colosseum_top.md").read())
1024
  else:
 
1058
  resp_vote_btn_list: list[gr.component.Component] = []
1059
  with gr.Column():
1060
  with gr.Row():
1061
+ masked_model_names.append(
1062
+ gr.Markdown(visible=False, elem_classes=["model-name-text"])
1063
+ )
1064
  with gr.Row():
1065
+ chatbots.append(
1066
+ gr.Chatbot(
1067
+ label="Model A",
1068
+ elem_id="chatbot",
1069
+ height=400,
1070
+ elem_classes=None if COLOSSEUM_UP else ["greyed-out"],
1071
+ )
1072
+ )
1073
  with gr.Row():
1074
+ left_resp_vote_btn = gr.Button(
1075
+ value="👈 Model A is better", interactive=False
1076
+ )
1077
  resp_vote_btn_list.append(left_resp_vote_btn)
1078
 
1079
  with gr.Column():
1080
  with gr.Row():
1081
+ masked_model_names.append(
1082
+ gr.Markdown(visible=False, elem_classes=["model-name-text"])
1083
+ )
1084
  with gr.Row():
1085
+ chatbots.append(
1086
+ gr.Chatbot(
1087
+ label="Model B",
1088
+ elem_id="chatbot",
1089
+ height=400,
1090
+ elem_classes=None if COLOSSEUM_UP else ["greyed-out"],
1091
+ )
1092
+ )
1093
  with gr.Row():
1094
+ right_resp_vote_btn = gr.Button(
1095
+ value="👉 Model B is better", interactive=False
1096
+ )
1097
  resp_vote_btn_list.append(right_resp_vote_btn)
1098
 
1099
  with gr.Row():
1100
  energy_comparison_message = gr.HTML(visible=False)
1101
 
1102
  with gr.Row():
1103
+ worth_energy_vote_btn = gr.Button(
1104
+ value="The better response was worth 👍 the extra energy.",
1105
+ visible=False,
1106
+ )
1107
+ notworth_energy_vote_btn = gr.Button(
1108
+ value="Not really worth that much more. 👎", visible=False
1109
+ )
1110
+ energy_vote_btn_list: list[gr.component.Component] = [
1111
+ worth_energy_vote_btn,
1112
+ notworth_energy_vote_btn,
1113
+ ]
1114
 
1115
  with gr.Row():
1116
+ play_again_btn = gr.Button(
1117
+ "Play again!", visible=False, elem_classes=["btn-submit"]
1118
+ )
1119
 
1120
  gr.Markdown(open("docs/colosseum_bottom.md").read())
1121
 
 
1125
  (prompt_input
1126
  .submit(add_prompt_disable_submit, [prompt_input, *chatbots], [prompt_input, prompt_submit_btn, model_preference_dropdown, *chatbots, controller_client], queue=False)
1127
  .then(generate_responses, [controller_client, model_preference_dropdown, *chatbots], [*chatbots], queue=True, show_progress="hidden")
1128
+ .then(enable_interact(2), None, resp_vote_btn_list, queue=False))
1129
  (prompt_submit_btn
1130
  .click(add_prompt_disable_submit, [prompt_input, *chatbots], [prompt_input, prompt_submit_btn, model_preference_dropdown, *chatbots, controller_client], queue=False)
1131
  .then(generate_responses, [controller_client, model_preference_dropdown, *chatbots], [*chatbots], queue=True, show_progress="hidden")
1132
+ .then(enable_interact(2), None, resp_vote_btn_list, queue=False))
1133
 
1134
  left_resp_vote_btn.click(
1135
  make_resp_vote_func(victory_index=0),
 
1166
  )
1167
  .then(None, _js=focus_prompt_input_js, queue=False))
1168
 
1169
+ # Tab: Leaderboards.
1170
+ dataframes = []
1171
+ for global_tbm, local_tbm in zip(global_tbms, local_tbms):
1172
+ with gr.Tab(global_tbm.get_tab_name()):
1173
+ # Box: Introduction text.
1174
+ with gr.Box():
1175
+ intro_text_type, intro_text = global_tbm.get_intro_text()
1176
+ if intro_text_type not in ["markdown", "html"]:
1177
+ raise ValueError(f"Invalid text type '{intro_text_type}' from {local_tbm}")
1178
+ if intro_text_type == "markdown":
1179
+ gr.Markdown(intro_text)
1180
+ else:
1181
+ gr.HTML(intro_text)
1182
+
1183
+ # Block: Checkboxes and sliders to select benchmarking parameters.
1184
+ with gr.Row():
1185
+ checkboxes: list[gr.CheckboxGroup] = []
1186
+ for key, choices in global_tbm.get_benchmark_checkboxes().items():
1187
+ # Check the first element by default.
1188
+ checkboxes.append(gr.CheckboxGroup(choices=choices, value=choices[:1], label=key))
1189
+
1190
+ sliders: list[gr.Slider] = []
1191
+ for key, (min_val, max_val, step, default) in global_tbm.get_benchmark_sliders().items():
1192
+ sliders.append(gr.Slider(minimum=min_val, maximum=max_val, value=default, step=step, label=key))
1193
+
1194
+ # Block: Leaderboard table.
1195
+ with gr.Row():
1196
+ dataframe = gr.Dataframe(
1197
+ type="pandas",
1198
+ elem_classes=["tab-leaderboard"],
1199
+ interactive=False,
1200
+ )
1201
+ dataframes.append(dataframe)
1202
 
1203
+ # Make sure the models have clickable links.
1204
+ dataframe.change(
1205
+ None, None, None, _js=dataframe_update_js, queue=False
1206
+ )
1207
+ # Table automatically updates when users check or uncheck any checkbox or move any slider.
1208
+ for element in [*checkboxes, *sliders]:
1209
+ element.change(
1210
+ global_tbm.__class__.set_filter_get_df,
1211
+ inputs=[local_tbm, *checkboxes, *sliders],
1212
+ outputs=dataframe,
1213
+ queue=False,
1214
+ )
1215
+
1216
+ # Block: More details about the leaderboard.
1217
+ with gr.Box():
1218
+ detail_text_type, detail_text = global_tbm.get_detail_text()
1219
+ if detail_text_type not in ["markdown", "html"]:
1220
+ raise ValueError(f"Invalid text type '{detail_text_type}' from {local_tbm}")
1221
+ if detail_text_type == "markdown":
1222
+ gr.Markdown(detail_text)
1223
+ else:
1224
+ gr.HTML(detail_text)
1225
+
1226
+ # Block: Leaderboard date.
1227
+ with gr.Row():
1228
+ gr.HTML(
1229
+ f"<h3 style='color: gray'>Last updated: {current_date}</h3>"
1230
+ )
1231
+
1232
+ # Tab: Legacy leaderboard.
1233
+ with gr.Tab("LLM Leaderboard (legacy)"):
1234
  with gr.Box():
1235
+ gr.HTML(global_ltbm.get_intro_text())
1236
 
1237
  # Block: Checkboxes to select benchmarking parameters.
1238
  with gr.Row():
1239
  with gr.Box():
1240
  gr.Markdown("### Benchmark results to show")
1241
  checkboxes: list[gr.CheckboxGroup] = []
1242
+ for key, choices in global_ltbm.schema.items():
1243
  # Specifying `value` makes everything checked by default.
1244
+ checkboxes.append(
1245
+ gr.CheckboxGroup(
1246
+ choices=choices, value=choices[:1], label=key
1247
+ )
1248
+ )
1249
 
1250
  # Block: Leaderboard table.
1251
  with gr.Row():
1252
+ dataframe = gr.Dataframe(
1253
+ type="pandas", elem_classes=["tab-leaderboard"], interactive=False
1254
+ )
1255
  # Make sure the models have clickable links.
1256
  dataframe.change(None, None, None, _js=dataframe_update_js, queue=False)
1257
  # Table automatically updates when users check or uncheck any checkbox.
1258
  for checkbox in checkboxes:
1259
+ checkbox.change(
1260
+ LegacyTableManager.set_filter_get_df,
1261
+ inputs=[tbm, *checkboxes],
1262
+ outputs=dataframe,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1263
  queue=False,
1264
  )
1265
 
 
1269
 
1270
  # Tab: About page.
1271
  with gr.Tab("About"):
1272
+ gr.Markdown(open("docs/about.md").read())
 
1273
 
1274
  # Citation
1275
  with gr.Accordion("📚 Citation", open=False, elem_id="citation-header"):
 
1283
  )
1284
 
1285
  # Load the table on page load.
1286
+ block.load(
1287
+ on_load,
1288
+ outputs=[dataframe, *dataframes, model_preference_dropdown],
1289
+ queue=False,
1290
+ )
1291
 
1292
 
1293
  if __name__ == "__main__":
1294
  parser = argparse.ArgumentParser()
1295
+ parser.add_argument(
1296
+ "--share", action="store_true", help="Specify if sharing is enabled"
1297
+ )
1298
  parser.add_argument("--concurrency", type=int, default=50)
1299
 
1300
  args = parser.parse_args()
1301
+ block.queue(concurrency_count=args.concurrency, api_open=False).launch(
1302
+ share=args.share, show_error=True
1303
+ )
benchmark/.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ **/results/
benchmark/README.md ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ML.ENERGY Leaderboard Benchmark Suite
2
+
3
+ ```
4
+  benchmark/
5
+ ├──  common/
6
+ ├──  diffusion/
7
+ │ └──  text-to-image/
8
+ └──  llm_text_generation/
9
+ ├──  chat/
10
+ └──  code/
11
+ ```
12
+
13
+ The `common` directory is for utilities that are common to all benchmarking tasks.
14
+ Other than that, there is one directory for each type of model and subdirectories for more specific tasks.
benchmark/common/download_weights.sh ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+
3
+ QUEUE_FILE="$1"
4
+
5
+ for model in $(tail -n +4 $QUEUE_FILE | awk '{print $2}'); do
6
+ HF_HOME=/data/leaderboard/hfcache huggingface-cli download $model --revision $(cat models/$model/revision.txt)
7
+ done
benchmark/common/start_nvml_container.sh ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ #!/usr/bin/env bash
2
+
3
+ docker run -dit --gpus all --cap-add SYS_ADMIN --name nvml nvidia/cuda:12.3.1-base-ubuntu22.04 bash
benchmark/diffusion/image-to-video/.dockerignore ADDED
@@ -0,0 +1 @@
 
 
1
+ README.md
benchmark/diffusion/image-to-video/Dockerfile ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM nvidia/cuda:12.1.0-base-ubuntu22.04
2
+
3
+ # Basic installs
4
+ ARG DEBIAN_FRONTEND=noninteractive
5
+ ENV TZ='America/Detroit'
6
+ RUN apt-get update -qq \
7
+ && apt-get -y --no-install-recommends install python3-pip \
8
+ && apt-get clean all \
9
+ && rm -r /var/lib/apt/lists/*
10
+
11
+ # HuggingFace cache dir
12
+ ENV HF_HOME=/root/.cache/huggingface
13
+
14
+ # Copy over benchmark suite and install dependencies
15
+ ADD . /workspace/image-to-video
16
+ WORKDIR /workspace/image-to-video
17
+ RUN pip install -r requirements.txt
18
+
19
+ # Benchmark script to run
20
+ ENTRYPOINT ["python3", "scripts/benchmark_one_datapoint.py"]
benchmark/diffusion/image-to-video/README.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Diffusion model (Image to Video)
2
+
3
+ This benchmark suite benchmarks diffusion models with the image-to-video task.
4
+
5
+ ## Setup
6
+
7
+ ### Docker images
8
+
9
+ ```sh
10
+ docker build -t mlenergy/leaderboard:diffusion-i2v .
11
+ ```
12
+
13
+ ### HuggingFace cache directory
14
+
15
+ The scripts assume the HuggingFace cache directory will be under `/data/leaderboard/hfcache` on the node that runs this benchmark.
16
+
17
+ ## Benchmarking
18
+
19
+ ### Obtaining one datapoint
20
+
21
+ The Docker image we've build runs `python scripts/benchmark_one_datapoint.py` as its `ENTRYPOINT`.
22
+
23
+ ```sh
24
+ docker run \
25
+ --gpus '"device=0"' \
26
+ --cap-add SYS_ADMIN \
27
+ -v /data/leaderboard/hfcache:/root/.cache/huggingface
28
+ -v $(pwd):/workspace/image-to-video \
29
+ mlenergy/leaderboard:diffusion-i2v \
30
+ --result-root results \
31
+ --batch-size 2 \
32
+ --power-limit 300 \
33
+ --save-every 5 \
34
+ --model ali-vilab/i2vgen-xl \
35
+ --dataset-path sharegpt4video/sharegpt4video_100.json \
36
+ --add-text-prompt \
37
+ --num-frames 16 \
38
+ --fps 16 \
39
+ --huggingface-token $HF_TOKEN
40
+ ```
41
+
42
+ ### Obtaining all datapoints for a single model
43
+
44
+ Export your HuggingFace hub token as environment variable `$HF_TOKEN`.
45
+
46
+ Run `scripts/benchmark_one_model.py`.
47
+
48
+ ### Running the entire suite with Pegasus
49
+
50
+ You can use [`pegasus`](https://github.com/jaywonchung/pegasus) to run the entire benchmark suite.
51
+ Queue and host files are in [`./pegasus`](./pegasus).
benchmark/diffusion/image-to-video/models/ali-vilab/i2vgen-xl/kwargs.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "torch_dtype": "torch.float16",
3
+ "variant": "fp16"
4
+ }
benchmark/diffusion/image-to-video/models/ali-vilab/i2vgen-xl/revision.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ 39e1979ea27be737b0278c06755e321f2b4360d5
benchmark/diffusion/image-to-video/models/stabilityai/stable-video-diffusion-img2vid-xt/kwargs.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "torch_dtype": "torch.float16",
3
+ "variant": "fp16"
4
+ }
benchmark/diffusion/image-to-video/models/stabilityai/stable-video-diffusion-img2vid-xt/revision.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ 9e43909513c6714f1bc78bcb44d96e733cd242aa
benchmark/diffusion/image-to-video/models/stabilityai/stable-video-diffusion-img2vid/kwargs.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "torch_dtype": "torch.float16",
3
+ "variant": "fp16"
4
+ }
benchmark/diffusion/image-to-video/models/stabilityai/stable-video-diffusion-img2vid/revision.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ 9cf024d5bfa8f56622af86c884f26a52f6676f2e
benchmark/diffusion/image-to-video/pegasus/A100/hosts_1gpu.yaml ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - hostname:
2
+ - localhost
3
+ gpu:
4
+ - 0
5
+ - 1
6
+ - 2
7
+ - 3
8
+ - 4
9
+ - 5
10
+ - 6
11
+ - 7
benchmark/diffusion/image-to-video/pegasus/A100/queue_1gpu.yaml ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ - command:
2
+ - "python scripts/benchmark_one_model.py {{ model }} --result-root results/joule --dataset-path sharegpt4video/sharegpt4video_100.json --gpu-ids {{ gpu }} --batch-sizes 8 4 2 1 --power-limits 400 --num-inference-steps 25"
3
+ model:
4
+ - '--model ali-vilab/i2vgen-xl --num-frames 16 --add-text-prompt'
5
+ - '--model stabilityai/stable-video-diffusion-img2vid --num-frames 14'
6
+ - '--model stabilityai/stable-video-diffusion-img2vid-xt --num-frames 25'
benchmark/diffusion/image-to-video/pegasus/H100/hosts_1gpu.yaml ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - hostname:
2
+ - localhost
3
+ gpu:
4
+ - 0
5
+ - 1
6
+ - 2
7
+ - 3
8
+ - 4
9
+ - 5
10
+ - 6
11
+ - 7
benchmark/diffusion/image-to-video/pegasus/H100/queue_1gpu.yaml ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ - command:
2
+ - "python scripts/benchmark_one_model.py {{ model }} --result-root results/joule --dataset-path sharegpt4video/sharegpt4video_700.json --gpu-ids {{ gpu }} --batch-sizes 64 32 16 8 4 2 1 --power-limits 700 --num-inference-steps 25"
3
+ model:
4
+ - '--model ali-vilab/i2vgen-xl --num-frames 16 --add-text-prompt'
5
+ - '--model stabilityai/stable-video-diffusion-img2vid --num-frames 14'
6
+ - '--model stabilityai/stable-video-diffusion-img2vid-xt --num-frames 25'
benchmark/diffusion/image-to-video/requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ torch
2
+ diffusers==0.29.2
3
+ accelerate
4
+ transformers
5
+ pillow
6
+ nvidia-ml-py
7
+ zeus-ml
benchmark/diffusion/image-to-video/scripts/aggregate_leaderboard_data.py ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ from glob import glob
3
+ from pathlib import Path
4
+
5
+ import tyro
6
+
7
+
8
+ FIELDS = {
9
+ "model": "Model",
10
+ "gpu_model": "GPU",
11
+ "energy_per_video": "Energy/video (J)",
12
+ "average_batch_latency": "Batch latency (s)",
13
+ "batch_size": "Batch size",
14
+ "num_inference_steps": "Denoising steps",
15
+ "num_frames": "Frames",
16
+ }
17
+
18
+ def main(results_dir: Path, output_dir: Path) -> None:
19
+ print(f"{results_dir} -> {output_dir}")
20
+
21
+ for model_dir in sorted(glob(f"{results_dir}/*/*")):
22
+ model_name = "/".join(model_dir.split("/")[-2:])
23
+ print(f" {model_name}")
24
+ (output_dir / model_name).mkdir(parents=True, exist_ok=True)
25
+ for file in sorted(glob(f"{model_dir}/bs*+results.json")):
26
+ raw_data = json.load(open(file))
27
+ raw_data["energy_per_video"] = raw_data["average_batch_energy"] / raw_data["batch_size"]
28
+
29
+ data = {}
30
+ for field1, field2 in FIELDS.items():
31
+ data[field2] = raw_data.pop(field1)
32
+
33
+ filename = f"bs{data['Batch size']}+steps{data['Denoising steps']}+frames{data['Frames']}.json"
34
+ json.dump(data, open(output_dir / model_name/ filename, "w"), indent=2)
35
+
36
+
37
+ if __name__ == "__main__":
38
+ tyro.cli(main)
benchmark/diffusion/image-to-video/scripts/aggregate_leaderboard_models.py ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ from glob import glob
3
+ from pathlib import Path
4
+
5
+ import tyro
6
+
7
+ def raw_params_to_readable(params: int) -> str:
8
+ return f"{params/1e9:.1f}B"
9
+
10
+ def main(results_dir: Path, output_file: Path) -> None:
11
+ output_file.parent.mkdir(parents=True, exist_ok=True)
12
+ print(f"{results_dir} -> {output_file}")
13
+
14
+ models = {}
15
+ for model_dir in sorted(glob(f"{results_dir}/*/*")):
16
+ model_name = "/".join(model_dir.split("/")[-2:])
17
+ print(f" {model_name}")
18
+ result_file_cand = glob(f"{model_dir}/bs1+*+results.json")
19
+ assert len(result_file_cand) == 1, model_name
20
+ results_data = json.load(open(result_file_cand[0]))
21
+ denosing_module_name = "unet" if "unet" in results_data["num_parameters"] else "transformer"
22
+ model_info = dict(
23
+ url=f"https://huggingface.co/{model_name}",
24
+ nickname=model_name.split("/")[-1].replace("-", " ").title(),
25
+ total_params=raw_params_to_readable(sum(results_data["num_parameters"].values())),
26
+ denoising_params=raw_params_to_readable(results_data["num_parameters"][denosing_module_name]),
27
+ resolution="NA",
28
+ )
29
+ assert model_name not in models
30
+ models[model_name] = model_info
31
+
32
+ json.dump(models, open(output_file, "w"), indent=2)
33
+
34
+
35
+ if __name__ == "__main__":
36
+ tyro.cli(main)
benchmark/diffusion/image-to-video/scripts/benchmark_one_datapoint.py ADDED
@@ -0,0 +1,300 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ import inspect
5
+ import argparse
6
+ from pprint import pprint
7
+ from pathlib import Path
8
+ from contextlib import suppress
9
+ from dataclasses import dataclass, field, asdict
10
+ from typing import Any
11
+
12
+ import torch
13
+ import pynvml
14
+ import numpy as np
15
+ from PIL import Image
16
+ from transformers.trainer_utils import set_seed
17
+ from diffusers import ModelMixin, DiffusionPipeline # type: ignore
18
+ from diffusers.utils import load_image, export_to_gif # pyright: reportPrivateImportUsage=false
19
+ from zeus.monitor import ZeusMonitor
20
+
21
+ # Disable torch gradients globally
22
+ torch.set_grad_enabled(False)
23
+
24
+
25
+ @dataclass
26
+ class Results:
27
+ model: str
28
+ num_parameters: dict[str, int]
29
+ gpu_model: str
30
+ num_infernece_steps: int
31
+ num_frames: int
32
+ power_limit: int
33
+ batch_size: int
34
+ num_prompts: int
35
+ total_runtime: float = 0.0
36
+ total_energy: float = 0.0
37
+ average_batch_latency: float = 0.0
38
+ average_images_per_second: float = 0.0
39
+ average_batch_energy: float = 0.0
40
+ average_power_consumption: float = 0.0
41
+ peak_memory: float = 0.0
42
+ results: list[Result] = field(default_factory=list, repr=False)
43
+
44
+
45
+ @dataclass
46
+ class ResultIntermediateBatched:
47
+ prompts: list[str]
48
+ images: list[Image.Image]
49
+ batch_latency: float = 0.0
50
+ batch_energy: float = 0.0
51
+ frames: np.ndarray | list[list[Image.Image]] = np.empty(0)
52
+
53
+
54
+ @dataclass
55
+ class Result:
56
+ batch_latency: float
57
+ sample_energy: float
58
+ prompt: str
59
+ video_path: str | None
60
+
61
+
62
+ def get_pipeline(model_id: str):
63
+ """Instantiate a Diffusers pipeline from a modes's HuggingFace Hub ID."""
64
+ # Load args to give to `from_pretrained` from the model's kwargs.json file
65
+ kwargs = json.load(open(f"models/{model_id}/kwargs.json"))
66
+ with suppress(KeyError):
67
+ kwargs["torch_dtype"] = eval(kwargs["torch_dtype"])
68
+
69
+ # Add additional args
70
+ kwargs["safety_checker"] = None
71
+ kwargs["revision"] = open(f"models/{model_id}/revision.txt").read().strip()
72
+
73
+ pipeline = DiffusionPipeline.from_pretrained(model_id, **kwargs).to("cuda:0")
74
+ print("\nInstantiated pipeline via DiffusionPipeline:\n", pipeline)
75
+
76
+ return pipeline
77
+
78
+
79
+ def load_text_image_prompts(
80
+ path: str,
81
+ batch_size: int,
82
+ num_batches: int | None = None,
83
+ ) -> tuple[int, list[tuple[list[str], list[Image.Image]]]]:
84
+ """Load the dataset to feed the model and return it as a list of batches of prompts.
85
+
86
+ Depending on the batch size, the final batch may not be full. The final batch
87
+ is dropped in that case. If `num_batches` is not None, only that many batches
88
+ is returned. If `num_batches` is None, all batches are returned.
89
+
90
+ Returns:
91
+ Total number of prompts and a list of batches of prompts.
92
+ """
93
+ dataset = json.load(open(path))
94
+ assert len(dataset["caption"]) == len(dataset["video_id"])
95
+
96
+ if num_batches is not None:
97
+ if len(dataset["caption"]) < num_batches * batch_size:
98
+ raise ValueError("Not enough data for the requested number of batches.")
99
+ dataset["caption"] = dataset["caption"][: num_batches * batch_size]
100
+ dataset["video_id"] = dataset["video_id"][: num_batches * batch_size]
101
+
102
+ image_path = Path(path).parent / "first_frame"
103
+ dataset["first_frame"] = [
104
+ load_image(str(image_path / f"{video_id}.jpg")) for video_id in dataset["video_id"]
105
+ ]
106
+
107
+ batched = [
108
+ (dataset["caption"][i : i + batch_size], dataset["first_frame"][i : i + batch_size])
109
+ for i in range(0, len(dataset["caption"]), batch_size)
110
+ ]
111
+ if len(batched[-1]) < batch_size:
112
+ batched.pop()
113
+
114
+ return len(batched) * batch_size, batched
115
+
116
+
117
+ def count_parameters(pipeline) -> dict[str, int]:
118
+ """Count the number of parameters in the given pipeline."""
119
+ num_params = {}
120
+ for name, attr in vars(pipeline).items():
121
+ if isinstance(attr, ModelMixin):
122
+ num_params[name] = attr.num_parameters(only_trainable=False, exclude_embeddings=True)
123
+ elif isinstance(attr, torch.nn.Module):
124
+ num_params[name] = sum(p.numel() for p in attr.parameters())
125
+ return num_params
126
+
127
+
128
+ def benchmark(args: argparse.Namespace) -> None:
129
+ if args.model.startswith("models/"):
130
+ args.model = args.model[len("models/") :]
131
+ if args.model.endswith("/"):
132
+ args.model = args.model[:-1]
133
+
134
+ set_seed(args.seed)
135
+
136
+ results_dir = Path(args.result_root) / args.model
137
+ results_dir.mkdir(parents=True, exist_ok=True)
138
+ benchmark_name = str(results_dir / f"bs{args.batch_size}+pl{args.power_limit}")
139
+ video_dir = results_dir / f"bs{args.batch_size}+pl{args.power_limit}+generated"
140
+ video_dir.mkdir(exist_ok=True)
141
+
142
+ arg_out_filename = f"{benchmark_name}+args.json"
143
+ with open(arg_out_filename, "w") as f:
144
+ f.write(json.dumps(vars(args), indent=2))
145
+ print(args)
146
+ print("Benchmark args written to", arg_out_filename)
147
+
148
+ zeus_monitor = ZeusMonitor()
149
+
150
+ pynvml.nvmlInit()
151
+ handle = pynvml.nvmlDeviceGetHandleByIndex(0)
152
+ gpu_model = pynvml.nvmlDeviceGetName(handle)
153
+ pynvml.nvmlDeviceSetPersistenceMode(handle, pynvml.NVML_FEATURE_ENABLED)
154
+ pynvml.nvmlDeviceSetPowerManagementLimit(handle, args.power_limit * 1000)
155
+ pynvml.nvmlShutdown()
156
+
157
+ num_prompts, batched_prompts = load_text_image_prompts(args.dataset_path, args.batch_size, args.num_batches)
158
+
159
+ pipeline = get_pipeline(args.model)
160
+
161
+ # Warmup
162
+ print("Warming up with two batches...")
163
+ for i in range(2):
164
+ params: dict[str, Any] = dict(
165
+ image=batched_prompts[i][1],
166
+ num_frames=args.num_frames,
167
+ num_inference_steps=args.num_inference_steps,
168
+ )
169
+ if args.add_text_prompt:
170
+ params["prompt"] = batched_prompts[i][0]
171
+
172
+ _ = pipeline(**params)
173
+
174
+ rng = torch.manual_seed(args.seed)
175
+
176
+ # Some models require a text prompt alongside the image (e.g., I2VGen-XL)
177
+ # In that case, `prompts` will not be passed to the model.
178
+ intermediates: list[ResultIntermediateBatched] = [
179
+ ResultIntermediateBatched(prompts=text, images=image) for text, image in batched_prompts
180
+ ]
181
+
182
+ # Different pipelines use different names for the FPS parameter
183
+ gen_signature= inspect.signature(pipeline.__call__)
184
+ fps_param_name_candidates = list(filter(lambda x: "fps" in x, gen_signature.parameters))
185
+ if not fps_param_name_candidates:
186
+ raise ValueError("No parameter with 'fps' in its name found in the pipeline's signature.")
187
+ if len(fps_param_name_candidates) > 1:
188
+ raise ValueError("Multiple parameters with 'fps' in their name found in the pipeline's signature.")
189
+ fps_param_name = fps_param_name_candidates[0]
190
+
191
+ torch.cuda.reset_peak_memory_stats(device="cuda:0")
192
+ zeus_monitor.begin_window("benchmark", sync_cuda=False)
193
+
194
+ # Build common parameter dict for all batches
195
+ params: dict[str, Any] = dict(
196
+ num_frames=args.num_frames,
197
+ num_inference_steps=args.num_inference_steps,
198
+ generator=rng,
199
+ )
200
+ params[fps_param_name] = args.fps
201
+ if args.height is not None:
202
+ params["height"] = args.height
203
+ if args.width is not None:
204
+ params["width"] = args.width
205
+
206
+ for ind, intermediate in enumerate(intermediates):
207
+ print(f"Batch {ind + 1}/{len(intermediates)}")
208
+
209
+ params["image"] = intermediate.images
210
+ if args.add_text_prompt:
211
+ params["prompt"] = intermediate.prompts
212
+
213
+ zeus_monitor.begin_window("batch", sync_cuda=False)
214
+ frames = pipeline(**params).frames
215
+ batch_measurements = zeus_monitor.end_window("batch", sync_cuda=False)
216
+
217
+ intermediate.frames = frames
218
+ intermediate.batch_latency = batch_measurements.time
219
+ intermediate.batch_energy = batch_measurements.total_energy
220
+
221
+ measurements = zeus_monitor.end_window("benchmark", sync_cuda=False)
222
+ peak_memory = torch.cuda.max_memory_allocated(device="cuda:0")
223
+
224
+ results: list[Result] = []
225
+ ind = 0
226
+ for intermediate in intermediates:
227
+ # Some pipelines just return a giant numpy array for all frames.
228
+ # In that case, scale frames to uint8 [0, 256] and convert to PIL.Image
229
+ if isinstance(intermediate.frames, np.ndarray):
230
+ frames = []
231
+ for video in intermediate.frames:
232
+ frames.append(
233
+ [Image.fromarray((frame * 255).astype(np.uint8)) for frame in video]
234
+ )
235
+ intermediate.frames = frames
236
+
237
+ for frames, prompt in zip(intermediate.frames, intermediate.prompts, strict=True):
238
+ if ind % args.save_every == 0:
239
+ video_path = str(video_dir / f"{prompt[:200]}.gif")
240
+ export_to_gif(frames, video_path, fps=args.fps)
241
+ else:
242
+ video_path = None
243
+
244
+ results.append(
245
+ Result(
246
+ batch_latency=intermediate.batch_latency,
247
+ sample_energy=intermediate.batch_energy / len(intermediate.prompts),
248
+ prompt=prompt,
249
+ video_path=video_path,
250
+ )
251
+ )
252
+ ind += 1
253
+
254
+ final_results = Results(
255
+ model=args.model,
256
+ num_parameters=count_parameters(pipeline),
257
+ gpu_model=gpu_model,
258
+ num_infernece_steps=args.num_inference_steps,
259
+ num_frames=args.num_frames,
260
+ power_limit=args.power_limit,
261
+ batch_size=args.batch_size,
262
+ num_prompts=num_prompts,
263
+ total_runtime=measurements.time,
264
+ total_energy=measurements.total_energy,
265
+ average_batch_latency=measurements.time / len(batched_prompts),
266
+ average_images_per_second=num_prompts / measurements.time,
267
+ average_batch_energy=measurements.total_energy / len(batched_prompts),
268
+ average_power_consumption=measurements.total_energy / measurements.time,
269
+ peak_memory=peak_memory,
270
+ results=results,
271
+ )
272
+
273
+ with open(f"{benchmark_name}+results.json", "w") as f:
274
+ f.write(json.dumps(asdict(final_results), indent=2))
275
+ print("Benchmark results written to", f"{benchmark_name}+results.json")
276
+
277
+ print("Benchmark results:")
278
+ pprint(final_results)
279
+
280
+
281
+ if __name__ == "__main__":
282
+ parser = argparse.ArgumentParser()
283
+ parser.add_argument("--model", type=str, required=True, help="The model to benchmark.")
284
+ parser.add_argument("--dataset-path", type=str, required=True, help="Path to the dataset to use.")
285
+ parser.add_argument("--add-text-prompt", action="store_true", help="Input text prompt alongside image.")
286
+ parser.add_argument("--result-root", type=str, help="The root directory to save results to.")
287
+ parser.add_argument("--batch-size", type=int, default=1, help="The size of each batch of prompts.")
288
+ parser.add_argument("--power-limit", type=int, default=300, help="The power limit to set for the GPU in Watts.")
289
+ parser.add_argument("--num-inference-steps", type=int, default=50, help="The number of denoising steps.")
290
+ parser.add_argument("--num-frames", type=int, default=1, help="The number of frames to generate.")
291
+ parser.add_argument("--fps", type=int, default=16, help="Frames per second for micro-conditioning.")
292
+ parser.add_argument("--height", type=int, help="Height of the generated video.")
293
+ parser.add_argument("--width", type=int, help="Width of the generated video.")
294
+ parser.add_argument("--num-batches", type=int, default=None, help="The number of batches to use from the dataset.")
295
+ parser.add_argument("--save-every", type=int, default=10, help="Save generations to file every N prompts.")
296
+ parser.add_argument("--seed", type=int, default=0, help="The seed to use for the RNG.")
297
+ parser.add_argument("--huggingface-token", type=str, help="The HuggingFace token to use.")
298
+ args = parser.parse_args()
299
+
300
+ benchmark(args)
benchmark/diffusion/image-to-video/scripts/benchmark_one_model.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import os
4
+ import argparse
5
+ import subprocess
6
+
7
+
8
+ def print_and_write(outfile, line: str, flush: bool = False):
9
+ print(line, end="", flush=flush)
10
+ outfile.write(line)
11
+ if flush:
12
+ outfile.flush()
13
+
14
+
15
+ def main(args: argparse.Namespace) -> None:
16
+ assert len(args.gpu_ids) == 1
17
+
18
+ hf_token = os.environ["HF_TOKEN"]
19
+
20
+ if args.model.startswith("models/"):
21
+ outdir = f"{args.result_root}/{args.model[len('models/'):]}"
22
+ else:
23
+ outdir = f"{args.result_root}/{args.model}"
24
+ os.makedirs(outdir, exist_ok=True)
25
+
26
+ outfile = open(f"{outdir}/gpus{''.join(args.gpu_ids)}.out.txt", "w")
27
+
28
+ print_and_write(outfile, f"Benchmarking {args.model}\n")
29
+ print_and_write(outfile, f"Batch sizes: {args.batch_sizes}\n")
30
+ print_and_write(outfile, f"Power limits: {args.power_limits}\n")
31
+
32
+ for batch_size in args.batch_sizes:
33
+ for power_limit in args.power_limits:
34
+ print_and_write(outfile, f"{batch_size=}, {power_limit=}\n", flush=True)
35
+ with subprocess.Popen(
36
+ args=[
37
+ "docker", "run",
38
+ "--gpus", '"device=' + ','.join(args.gpu_ids) + '"',
39
+ "--cap-add", "SYS_ADMIN",
40
+ "--name", f"leaderboard-i2v-{''.join(args.gpu_ids)}",
41
+ "--rm",
42
+ "-v", "/data/leaderboard/hfcache:/root/.cache/huggingface",
43
+ "-v", f"{os.getcwd()}:/workspace/image-to-video",
44
+ "mlenergy/leaderboard:diffusion-i2v",
45
+ "--dataset-path", args.dataset_path,
46
+ "--result-root", args.result_root,
47
+ "--batch-size", batch_size,
48
+ "--num-batches", "10",
49
+ "--power-limit", power_limit,
50
+ "--model", args.model,
51
+ "--huggingface-token", hf_token,
52
+ "--num-frames", args.num_frames,
53
+ "--num-inference-steps", args.num_inference_steps,
54
+ ] + (["--add-text-prompt"] if args.add_text_prompt else []),
55
+ stdout=subprocess.PIPE,
56
+ stderr=subprocess.STDOUT,
57
+ text=True,
58
+ ) as proc:
59
+ if proc.stdout:
60
+ i = 0
61
+ for line in proc.stdout:
62
+ print_and_write(outfile, line, flush=i % 50 == 0)
63
+ i += 1
64
+
65
+ # If proc exited with non-zero status, it's probably an OOM.
66
+ # Move on to the next batch size.
67
+ if proc.returncode != 0:
68
+ break
69
+
70
+
71
+
72
+ if __name__ == "__main__":
73
+ parser = argparse.ArgumentParser()
74
+ parser.add_argument("--model", type=str, help="ID of the model to benchmark")
75
+ parser.add_argument("--result-root", type=str, help="Root directory to store the results")
76
+ parser.add_argument("--gpu-ids", type=str, nargs="+", help="GPU IDs to use")
77
+ parser.add_argument("--batch-sizes", type=str, nargs="+", default=["8", "4", "2", "1"], help="Batch sizes to benchmark")
78
+ parser.add_argument("--power-limits", type=str, nargs="+", default=["400", "300", "200"], help="Power limits to benchmark")
79
+ parser.add_argument("--num-frames", type=str, help="Number of frames to generate")
80
+ parser.add_argument("--num-inference-steps", type=str, help="Number of denoising steps")
81
+ parser.add_argument("--add-text-prompt", action="store_true", help="Input text prompt alongside image.")
82
+ parser.add_argument("--dataset-path", type=str, help="Path to the dataset JSON file.")
83
+ args = parser.parse_args()
84
+ main(args)
benchmark/diffusion/image-to-video/sharegpt4video/.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ first_frame/
benchmark/diffusion/image-to-video/sharegpt4video/README.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ShareGPT4Video dataset
2
+
3
+ For the image-to-video task, we sample 100 video-caption pairs from the ShareGPT4Video datset to feed to the diffusion model to generate videos.
4
+
5
+ ## Filtering the dataset
6
+
7
+ Download the dataset with captions and video paths.
8
+
9
+ ```sh
10
+ wget https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video/resolve/main/sharegpt4video_40k.jsonl
11
+ ```
12
+
13
+ Sample video-caption pairs.
14
+ You can adjust the `NUM_SAMPLES` variable in the script to change the size of the generated dataset. By default, 100 pairs will be sampled and saved as `sharegpt4video_100.json`.
15
+
16
+ ```sh
17
+ python sample.py
18
+ ```
19
+
20
+ Download and unzip the chunk of videos.
21
+
22
+ ```sh
23
+ wget https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video/resolve/main/zip_folder/panda/panda_videos_1.zip
24
+ unzip panda_videos_1.zip -d panda
25
+ ```
26
+
27
+ Extract the first frame of the video and save under `first_frame/`.
28
+
29
+ ```sh
30
+ pip install opencv-python
31
+ python extract_first_frame.py
32
+ ```
benchmark/diffusion/image-to-video/sharegpt4video/extract_first_frame.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+
4
+ import cv2
5
+
6
+ DATASET_PATH = "sharegpt4video_700.json"
7
+
8
+
9
+ def main() -> None:
10
+ os.makedirs("first_frame", exist_ok=True)
11
+
12
+ for video_id in json.load(open(DATASET_PATH))["video_id"]:
13
+ cap = cv2.VideoCapture(f"panda/{video_id}.mp4")
14
+ ret, frame = cap.read()
15
+ assert ret, f"failed to read first frame of video {video_id}"
16
+ cv2.imwrite(f"first_frame/{video_id}.jpg", frame)
17
+ cap.release()
18
+
19
+
20
+ if __name__ == "__main__":
21
+ main()
benchmark/diffusion/image-to-video/sharegpt4video/sample.py ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import random
3
+
4
+ DATASET_PATH = "sharegpt4video_40k.jsonl"
5
+ VIDEO_SHARD_NAME = "panda_videos_1.zip"
6
+ NUM_SAMPLES = 700
7
+ SEED = 1
8
+
9
+
10
+ def main() -> None:
11
+ dataset = [json.loads(line) for line in open(DATASET_PATH) if VIDEO_SHARD_NAME in line]
12
+ random.seed(SEED)
13
+ random.shuffle(dataset)
14
+
15
+ sampled = dict(caption=[], video_id=[])
16
+ for sample in dataset[:NUM_SAMPLES]:
17
+ assert sample["zip_folder"] == VIDEO_SHARD_NAME, f"sample from wrong video shard: {sample}"
18
+ whole_video_caption = next(
19
+ (c for c in sample["captions"] if c["idx"] == "-1"), None
20
+ )
21
+ assert whole_video_caption is not None, f"whole video caption not found for sample: {sample}"
22
+ sampled["caption"].append(whole_video_caption["content"])
23
+ sampled["video_id"].append(sample["video_id"])
24
+
25
+ json.dump(sampled, open(f"sharegpt4video_{NUM_SAMPLES}.json", "w"))
26
+
27
+
28
+ if __name__ == "__main__":
29
+ main()
benchmark/diffusion/image-to-video/sharegpt4video/sharegpt4video_100.json ADDED
The diff for this file is too large to render. See raw diff
 
benchmark/diffusion/text-to-image/.dockerignore ADDED
@@ -0,0 +1 @@
 
 
1
+ README.md
benchmark/diffusion/text-to-image/Dockerfile ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM nvidia/cuda:12.1.0-base-ubuntu22.04
2
+
3
+ # Basic installs
4
+ ARG DEBIAN_FRONTEND=noninteractive
5
+ ENV TZ='America/Detroit'
6
+ RUN apt-get update -qq \
7
+ && apt-get -y --no-install-recommends install python3-pip \
8
+ && apt-get clean all \
9
+ && rm -r /var/lib/apt/lists/*
10
+
11
+ # HuggingFace cache dir
12
+ ENV HF_HOME=/root/.cache/huggingface
13
+
14
+ # Copy over benchmark suite and install dependencies
15
+ ADD . /workspace/text-to-image
16
+ WORKDIR /workspace/text-to-image
17
+ RUN pip install -r requirements.txt
18
+
19
+ # Benchmark script to run
20
+ ENTRYPOINT ["python3", "scripts/benchmark_one_datapoint.py"]
benchmark/diffusion/text-to-image/README.md ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Diffusion model (Text to Image)
2
+
3
+ This benchmark suite benchmarks diffusion models with the text-to-image task.
4
+
5
+ ## Setup
6
+
7
+ ### Docker images
8
+
9
+ ```sh
10
+ docker build -t mlenergy/leaderboard:diffusion-t2i .
11
+ ```
12
+
13
+ ### HuggingFace cache directory
14
+
15
+ The scripts assume the HuggingFace cache directory will be under `/data/leaderboard/hfcache` on the node that runs this benchmark.
16
+
17
+ ## Benchmarking
18
+
19
+ ### Obtaining one datapoint
20
+
21
+ The Docker image we've build runs `python scripts/benchmark_one_datapoint.py` as its `ENTRYPOINT`.
22
+
23
+ ```sh
24
+ docker run \
25
+ --gpus '"device=0"' \
26
+ --cap-add SYS_ADMIN \
27
+ -v /data/leaderboard/hfcache:/root/.cache/huggingface
28
+ -v $(pwd):/workspace/text-to-image \
29
+ mlenergy/leaderboard:diffusion-t2i \
30
+ --result-root results \
31
+ --batch-size 2 \
32
+ --power-limit 300 \
33
+ --image-save-every 5 \
34
+ --num-inference-steps 25 \
35
+ --model stabilityai/stable-diffusion-2-1 \
36
+ --huggingface-token $HF_TOKEN
37
+ ```
38
+
39
+ ### Obtaining all datapoints for a single model
40
+
41
+ Export your HuggingFace hub token as environment variable `$HF_TOKEN`.
42
+
43
+ Run `scripts/benchmark_one_model.py`.
44
+
45
+ ### Running the entire suite with Pegasus
46
+
47
+ You can use [`pegasus`](https://github.com/jaywonchung/pegasus) to run the entire benchmark suite.
48
+ Queue and host files are in [`./pegasus`](./pegasus).
benchmark/diffusion/text-to-image/models/SimianLuo/LCM_Dreamshaper_v7/kwargs.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "torch_dtype": "torch.float16"
3
+ }
benchmark/diffusion/text-to-image/models/SimianLuo/LCM_Dreamshaper_v7/revision.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ 4721097975058205c4edcdece2cc574b7dd7bc04
benchmark/diffusion/text-to-image/models/kandinsky-community/kandinsky-2-2-decoder/kwargs.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "torch_dtype": "torch.float16"
3
+ }
benchmark/diffusion/text-to-image/models/kandinsky-community/kandinsky-2-2-decoder/revision.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ main
benchmark/diffusion/text-to-image/models/kandinsky-community/kandinsky-3/kwargs.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "torch_dtype": "torch.float16",
3
+ "variant": "fp16"
4
+ }
benchmark/diffusion/text-to-image/models/kandinsky-community/kandinsky-3/revision.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ bf79e6c219da8a94abb50235fdc4567eb8fb4632
benchmark/diffusion/text-to-image/models/prompthero/openjourney-v4/kwargs.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "torch_dtype": "torch.float16"
3
+ }
benchmark/diffusion/text-to-image/models/prompthero/openjourney-v4/revision.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ b195ed2d503f3eb29637050a886d77bd81d35f0e
benchmark/diffusion/text-to-image/models/segmind/SSD-1B/kwargs.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "torch_dtype": "torch.float16",
3
+ "variant": "fp16"
4
+ }
benchmark/diffusion/text-to-image/models/segmind/SSD-1B/revision.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ 60987f37e94cd59c36b1cba832b9f97b57395a10
benchmark/diffusion/text-to-image/models/stabilityai/sdxl-turbo/kwargs.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "torch_dtype": "torch.float16",
3
+ "variant": "fp16"
4
+ }
benchmark/diffusion/text-to-image/models/stabilityai/sdxl-turbo/revision.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ f4b0486b498f84668e828044de1d0c8ba486e05b
benchmark/diffusion/text-to-image/models/stabilityai/stable-cascade/kwargs.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "torch_dtype": "torch.bfloat16",
3
+ "variant": "bf16"
4
+ }
benchmark/diffusion/text-to-image/models/stabilityai/stable-cascade/revision.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ main
benchmark/diffusion/text-to-image/models/stabilityai/stable-diffusion-2-1/kwargs.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "torch_dtype": "torch.float16",
3
+ "variant": "fp16"
4
+ }