Spaces:
Running
Running
File size: 10,817 Bytes
239e0f7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 |
import streamlit as st
import pandas as pd
from PIL import Image
import base64
from io import BytesIO
import numpy as np
# βββ Page config ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
st.set_page_config(page_title="ManyICLBench Leaderboard", layout="wide")
logo_image = Image.open("src/manyicl_logo.png")
def encode_image(image):
buffered = BytesIO()
image.save(buffered, format="PNG")
return base64.b64encode(buffered.getvalue()).decode("utf-8")
img_data = encode_image(logo_image)
st.markdown(
f"""
<div class="logo-container" style="display:flex; justify-content: center; align-items: center; gap: 20px;">
<img src="data:image/png;base64,{img_data}" style="width:50%; max-width:700px;"/>
</div>
""",
unsafe_allow_html=True
)
st.markdown(
'''
<div class="header">
<br/>
<p style="font-size:22px;">
ManyICLBench: Benchmarking Large Language Models' Long Context Capabilities with Many-Shot In-Context Learning
</p>
<p style="font-size:20px;">
π <a href="https://arxiv.org/abs/2411.07130">Paper</a> | π» <a href="https://github.com/launchnlp/ManyICLBench">GitHub</a> | π€ <a href="https://huggingface.co/datasets/launch/ManyICLBench/">Dataset</a> |
βοΈ <strong>Version</strong>: <strong>V1</strong> | <strong># Models</strong>: 12 | Updated: <strong>June 2025</strong>
</p>
</div>
''',
unsafe_allow_html=True
)
# βββ Load data ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
@st.cache_data
def load_data(path):
df = pd.read_csv(path)
if 'Task' in df.columns: # Rename Task to Models for consistency
df = df.rename(columns={'Task': 'Models'})
score_cols = ['1000', '2000', '4000', '8000', '16000', '32000', '64000', '128000']
# Keep existing avg and avg.L columns
# Compute rank per column (1 = best)
for col in score_cols + ['avg', 'avg.L']:
df[f"{col}_rank"] = df[col].rank(ascending=False, method="min").astype(int)
return df
# Add evaluation metrics explanation
st.markdown("## π Evaluation Metrics")
st.markdown("""
- **Per-length Performance**: Performance at different context lengths (1K to 128K tokens)
- **avg**: Average performance across all context lengths
- **avg.L**: Average performance on longer contexts (>32K tokens)
Higher scores indicate better performance, with all metrics reported as percentages (0-100).
Red indicates performance improvement compared to 1k. Blue indicates performance downgrade compared to 1k. A darker color means higher improvement or downgrade.
""")
def display_table(df, cols):
# Precompute max values for Avg and Avg.L
max_avg = df['avg'].max()
max_avg_l = df['avg.L'].max()
# Build raw HTML table
html = "<table style='border-collapse:collapse; width:100%; font-size:14px;'>"
# Format header labels
html += "<tr>"
for col in cols:
style = "padding:6px;"
label = ""
if col in ['1000', '2000', '4000', '8000', '16000', '32000', '64000', '128000']:
# Convert to K format
val = int(col) // 1000
label = f"{val}K"
else:
label = col.title() # Capitalize first letter
if col in ["Model", "Models"]:
style += " width: 15%;"
html += f"<th style='{style}'>{label}</th>"
html += "</tr>"
# rows
for _, row in df.iterrows():
html += "<tr>"
for col in cols:
val = row[col]
if col in ["Model", "Models"]:
html += f"<td style='padding:6px; text-align:left; width: 15%;'>{val}</td>"
else:
# Format value
val_str = f"{val:.1f}" if isinstance(val, (float, np.float64)) else val
# Determine if this column should be colored
if col in ['1000', 'avg', 'avg.L']:
# No coloring for these columns, but add bolding for max values
bold = ""
if (col == 'avg' and val == max_avg) or \
(col == 'avg.L' and val == max_avg_l):
bold = "font-weight:bold;"
style = f"padding:6px; border: 1px solid #444; {bold}"
else:
# Calculate relative improvement from 1k baseline
baseline = float(row['1000'])
if baseline != 0:
relative_change = float(val) / baseline - 1 # -1 to center at 0
# Clamp the change to a reasonable range for color scaling
clamped_change = max(min(relative_change, 1.5), -0.5)
# Normalize to 0-1 range where 0.5 is the neutral point (no change)
if clamped_change < 0:
# Map [-0.5, 0) to [0, 0.5)
norm = clamped_change + 0.5
else:
# Map [0, 1.5] to [0.5, 1.0]
norm = 0.5 + (clamped_change / 3.0)
# Color interpolation:
# norm = 0 -> blue (100, 149, 237)
# norm = 0.5 -> white (255, 255, 255)
# norm = 1 -> red (220, 20, 60)
if norm < 0.5: # Blue to White
# Interpolate from blue to white
factor = norm * 2 # 0 to 1
r = int(100 + (255 - 100) * factor)
g = int(149 + (255 - 149) * factor)
b = int(237 + (255 - 237) * factor)
else: # White to Red
# Interpolate from white to red
factor = (norm - 0.5) * 2 # 0 to 1
r = int(255 - (255 - 220) * factor)
g = int(255 - (255 - 20) * factor)
b = int(255 - (255 - 60) * factor)
style = f"background-color:rgba({r},{g},{b},0.8); padding:6px; border: 1px solid #444;"
else:
style = "padding:6px; border: 1px solid #444;"
html += f"<td style='{style}'>{val_str}</td>"
html += "</tr>"
html += "</table>"
st.markdown(html, unsafe_allow_html=True)
# Display Retrieval table
st.markdown("## SSL Tasks")
st.markdown("Similar-sample Learning tasks require models to learn from a small set of similar demostration, therefore evaluating models' ability to retrieve similar samples.")
df = load_data("src/Retrieval_full_200.csv")
cols = ["Models", "1000", "2000", "4000", "8000", "16000", "32000", "64000", "128000", "avg", "avg.L"]
display_table(df, cols)
# Display Global Context Understanding table
st.markdown("## ASL Tasks")
st.markdown("All-Sample Learning tasks require models to learn from all the demostrations, therefore evaluating models' ability to understand the global context.")
df = load_data("src/Global Context Understanding_full_200.csv")
cols = ["Models", "1000", "2000", "4000", "8000", "16000", "32000", "64000", "128000", "avg", "avg.L"]
display_table(df, cols)
st.markdown("## π Abstract")
st.write(
"""
Many-shot in-context learning (ICL) has emerged as a unique setup to both utilize and test the ability of large language models to handle long context.This paper delves into long-context language model (LCLM) evaluation through many-shot ICL. We first ask: what types of ICL tasks benefit from additional demonstrations, and how effective are they in evaluating LCLMs?
We find that classification and summarization tasks show performance improvements with additional demonstrations, while translation and reasoning tasks do not exhibit clear trends.
Next, we investigate the extent to which different tasks necessitate retrieval versus global context understanding.
We develop metrics to categorize ICL tasks into two groups: (i) similar-sample learning (**SSL**): tasks where retrieval of the most similar examples is sufficient for good performance, and (ii) all-sample learning (**ASL**): tasks that necessitate a deeper comprehension of all examples in the prompt.
Lastly, we introduce a new many-shot ICL benchmark built on existing ICL tasks, ManyICLBench, to characterize model's ability on both fronts and benchmark 12 LCLMs using ManyICLBench. We find that while state-of-the-art models demonstrate good performance up to 64k tokens in SSL tasks, many models experience significant performance drops at only 16k tokens in ASL tasks.
"""
)
st.markdown("## Dataset Details")
st.markdown("""
| **Dataset** | **Task Category** | **Avg. Tokens / Shot** | **Max # of Shots** | **# of Tasks** |
| :--- | :--- | :--- | :--- | :--- |
| BANKING77 | Intent Classification | 13.13 | 5386 | 1 |
| GoEmotions | Emotion Classification | 15.85 | 5480 | 1 |
| DialogRE | Relation Classification | 233.27 | 395 | 1 |
| TREC | Question Classification | 11.25 | 6272 | 1 |
| CLINC150 | Intent Classification | 8.95 | 7252 | 1 |
| MATH | Math reasoning | [185.52, 407.90] | [286, 653] | 4 |
| GSM8K | Math reasoning | 55.78 | 784 | 1 |
| BBH | Reasoning | [48.27, 243.01] | [406, 2660] | 4 |
| GPQA | MQ - Science | [183.55, 367.02] | [314, 580] | 1 |
| ARC | MQ - Science | [61.54, 61.54] | [1997, 2301] | 2 |
| XLSUM | New Summarization | 621.32 | 220 | 1 |
GPT-4o tokenizer is used to calculate # of tokens. Max # of shots is the number of shots can be fitted into the 128k context window. For datasets that have multiple subtasks, we list the range for each value.
**ASL Tasks**: banking77, dialogRE, TREC, CLINC150, and BBH_geometric_shapes
**SSL Tasks**: GSM8K, MATH tasks, GSM8K, XLSUM, GPQA_cot, ARC_challenge, BBH-dyck_languages, BBH-salient_translation_error_detection, and BBH-word_sorting.
""")
st.markdown('## π€ Submit Your Model')
st.write(
"""
π You can submit your model through the following link: [https://forms.gle/eWjzPDusDJSbXCCT7](https://forms.gle/eWjzPDusDJSbXCCT7)
"""
)
st.markdown("## π Citation")
st.write("""
```bibtex
@article{zou2025manyshotincontextlearninglongcontext,
title={On Many-Shot In-Context Learning for Long-Context Evaluation},
author={Kaijian Zou and Muhammad Khalifa and Lu Wang},
journal={arXiv preprint arXiv:2411.07130},
year={2025}
}
```
""")
|