File size: 10,817 Bytes
239e0f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
import streamlit as st
import pandas as pd
from PIL import Image
import base64
from io import BytesIO  
import numpy as np

# ─── Page config ──────────────────────────────────────────────────────────────
st.set_page_config(page_title="ManyICLBench Leaderboard", layout="wide")
logo_image = Image.open("src/manyicl_logo.png")


def encode_image(image):
    buffered = BytesIO()
    image.save(buffered, format="PNG")
    return base64.b64encode(buffered.getvalue()).decode("utf-8")

img_data = encode_image(logo_image)

st.markdown(
    f"""
    <div class="logo-container" style="display:flex; justify-content: center; align-items: center; gap: 20px;">
        <img src="data:image/png;base64,{img_data}" style="width:50%; max-width:700px;"/>
    </div>
    """,
    unsafe_allow_html=True
)

st.markdown(
    '''
    <div class="header">
        <br/>
        <p style="font-size:22px;">
        ManyICLBench: Benchmarking Large Language Models' Long Context Capabilities with Many-Shot In-Context Learning
        </p>
        <p style="font-size:20px;">
              πŸ“‘ <a href="https://arxiv.org/abs/2411.07130">Paper</a> | πŸ’» <a href="https://github.com/launchnlp/ManyICLBench">GitHub</a> | πŸ€— <a href="https://huggingface.co/datasets/launch/ManyICLBench/">Dataset</a> |     
               βš™οΈ <strong>Version</strong>: <strong>V1</strong> | <strong># Models</strong>: 12 | Updated: <strong>June 2025</strong>
        </p>
    </div>
    ''',
    unsafe_allow_html=True
)

# ─── Load data ────────────────────────────────────────────────────────────────
@st.cache_data
def load_data(path):
    df = pd.read_csv(path)
    if 'Task' in df.columns:  # Rename Task to Models for consistency
        df = df.rename(columns={'Task': 'Models'})
    score_cols = ['1000', '2000', '4000', '8000', '16000', '32000', '64000', '128000']
    # Keep existing avg and avg.L columns
    # Compute rank per column (1 = best)
    for col in score_cols + ['avg', 'avg.L']:
        df[f"{col}_rank"] = df[col].rank(ascending=False, method="min").astype(int)
    return df

# Add evaluation metrics explanation
st.markdown("## πŸ“Š Evaluation Metrics")
st.markdown("""
- **Per-length Performance**: Performance at different context lengths (1K to 128K tokens)
- **avg**: Average performance across all context lengths
- **avg.L**: Average performance on longer contexts (>32K tokens)

Higher scores indicate better performance, with all metrics reported as percentages (0-100).

Red indicates performance improvement compared to 1k. Blue indicates performance downgrade compared to 1k. A darker color means higher improvement or downgrade.
""")

def display_table(df, cols):
    # Precompute max values for Avg and Avg.L
    max_avg = df['avg'].max()
    max_avg_l = df['avg.L'].max()

    # Build raw HTML table
    html = "<table style='border-collapse:collapse; width:100%; font-size:14px;'>"
    
    # Format header labels
    html += "<tr>"
    for col in cols:
        style = "padding:6px;"
        label = ""
        if col in ['1000', '2000', '4000', '8000', '16000', '32000', '64000', '128000']:
            # Convert to K format
            val = int(col) // 1000
            label = f"{val}K"
        else:
            label = col.title()  # Capitalize first letter
        
        if col in ["Model", "Models"]:
            style += " width: 15%;"

        html += f"<th style='{style}'>{label}</th>"
    html += "</tr>"
    
    # rows
    for _, row in df.iterrows():
        html += "<tr>"
        for col in cols:
            val = row[col]
            if col in ["Model", "Models"]:
                html += f"<td style='padding:6px; text-align:left; width: 15%;'>{val}</td>"
            else:
                # Format value
                val_str = f"{val:.1f}" if isinstance(val, (float, np.float64)) else val
                
                # Determine if this column should be colored
                if col in ['1000', 'avg', 'avg.L']:
                    # No coloring for these columns, but add bolding for max values
                    bold = ""
                    if (col == 'avg' and val == max_avg) or \
                       (col == 'avg.L' and val == max_avg_l):
                        bold = "font-weight:bold;"
                    style = f"padding:6px; border: 1px solid #444; {bold}"
                else:
                    # Calculate relative improvement from 1k baseline
                    baseline = float(row['1000'])
                    if baseline != 0:
                        relative_change = float(val) / baseline - 1  # -1 to center at 0
                        # Clamp the change to a reasonable range for color scaling
                        clamped_change = max(min(relative_change, 1.5), -0.5)
                        
                        # Normalize to 0-1 range where 0.5 is the neutral point (no change)
                        if clamped_change < 0:
                            # Map [-0.5, 0) to [0, 0.5)
                            norm = clamped_change + 0.5
                        else:
                            # Map [0, 1.5] to [0.5, 1.0]
                            norm = 0.5 + (clamped_change / 3.0)

                        # Color interpolation:
                        # norm = 0   -> blue  (100, 149, 237)
                        # norm = 0.5 -> white (255, 255, 255)
                        # norm = 1   -> red   (220, 20, 60)
                        
                        if norm < 0.5:  # Blue to White
                            # Interpolate from blue to white
                            factor = norm * 2  # 0 to 1
                            r = int(100 + (255 - 100) * factor)
                            g = int(149 + (255 - 149) * factor)
                            b = int(237 + (255 - 237) * factor)
                        else:  # White to Red
                            # Interpolate from white to red
                            factor = (norm - 0.5) * 2  # 0 to 1
                            r = int(255 - (255 - 220) * factor)
                            g = int(255 - (255 - 20) * factor)
                            b = int(255 - (255 - 60) * factor)
                        
                        style = f"background-color:rgba({r},{g},{b},0.8); padding:6px; border: 1px solid #444;"
                    else:
                        style = "padding:6px; border: 1px solid #444;"
                
                html += f"<td style='{style}'>{val_str}</td>"
        html += "</tr>"
    html += "</table>"
    st.markdown(html, unsafe_allow_html=True)

# Display Retrieval table
st.markdown("## SSL Tasks")
st.markdown("Similar-sample Learning tasks require models to learn from a small set of similar demostration, therefore evaluating models' ability to retrieve similar samples.")
df = load_data("src/Retrieval_full_200.csv")
cols = ["Models", "1000", "2000", "4000", "8000", "16000", "32000", "64000", "128000", "avg", "avg.L"]
display_table(df, cols)

# Display Global Context Understanding table
st.markdown("## ASL Tasks")
st.markdown("All-Sample Learning tasks require models to learn from all the demostrations, therefore evaluating models' ability to understand the global context.")
df = load_data("src/Global Context Understanding_full_200.csv")
cols = ["Models", "1000", "2000", "4000", "8000", "16000", "32000", "64000", "128000", "avg", "avg.L"]
display_table(df, cols)

st.markdown("## πŸ“š Abstract")
st.write(
"""
Many-shot in-context learning (ICL) has emerged as a unique setup to both utilize and test the ability of large language models to handle long context.This paper delves into long-context language model (LCLM) evaluation through many-shot ICL. We first ask: what types of ICL tasks benefit from additional demonstrations, and how effective are they in evaluating LCLMs?
We find that classification and summarization tasks show performance improvements with additional demonstrations, while translation and reasoning tasks do not exhibit clear trends. 
Next, we investigate the extent to which different tasks necessitate retrieval versus global context understanding.
We develop metrics to categorize ICL tasks into two groups: (i) similar-sample learning (**SSL**): tasks where retrieval of the most similar examples is sufficient for good performance, and (ii) all-sample learning (**ASL**): tasks that necessitate a deeper comprehension of all examples in the prompt. 
Lastly, we introduce a new many-shot ICL benchmark built on existing ICL tasks, ManyICLBench, to characterize model's ability on both fronts and benchmark 12 LCLMs using ManyICLBench. We find that while state-of-the-art models demonstrate good performance up to 64k tokens in SSL tasks, many models experience significant performance drops at only 16k tokens in ASL tasks.
"""
)
st.markdown("## Dataset Details")
st.markdown("""
| **Dataset** | **Task Category** | **Avg. Tokens / Shot** | **Max # of Shots** | **# of Tasks** |
| :--- | :--- | :--- | :--- | :--- |
| BANKING77 | Intent Classification | 13.13 | 5386 | 1 |
| GoEmotions | Emotion Classification | 15.85 | 5480 | 1 |
| DialogRE | Relation Classification | 233.27 | 395 | 1 |
| TREC | Question Classification | 11.25 | 6272 | 1 |
| CLINC150 | Intent Classification | 8.95 | 7252 | 1 |
| MATH | Math reasoning | [185.52, 407.90] | [286, 653] | 4 |
| GSM8K | Math reasoning | 55.78 | 784 | 1 |
| BBH | Reasoning | [48.27, 243.01] | [406, 2660] | 4 |
| GPQA | MQ - Science | [183.55, 367.02] | [314, 580] | 1 |
| ARC | MQ - Science | [61.54, 61.54] | [1997, 2301] | 2 |
| XLSUM | New Summarization | 621.32 | 220 | 1 |

GPT-4o tokenizer is used to calculate # of tokens. Max # of shots is the number of shots can be fitted into the 128k context window. For datasets that have multiple subtasks, we list the range for each value.

**ASL Tasks**: banking77, dialogRE, TREC, CLINC150, and BBH_geometric_shapes

**SSL Tasks**: GSM8K, MATH tasks, GSM8K, XLSUM, GPQA_cot, ARC_challenge, BBH-dyck_languages, BBH-salient_translation_error_detection, and BBH-word_sorting.

""")
st.markdown('## πŸ€– Submit Your Model')
st.write(
"""
πŸ‘‰ You can submit your model through the following link: [https://forms.gle/eWjzPDusDJSbXCCT7](https://forms.gle/eWjzPDusDJSbXCCT7)
"""
)

st.markdown("## πŸ“š Citation")
st.write("""
```bibtex
@article{zou2025manyshotincontextlearninglongcontext,
  title={On Many-Shot In-Context Learning for Long-Context Evaluation}, 
  author={Kaijian Zou and Muhammad Khalifa and Lu Wang},
  journal={arXiv preprint arXiv:2411.07130},
  year={2025}
}
```
""")