Spaces:

launch
/

ExpertLongBench

Running

File size: 5,822 Bytes

8e68ad1
35c36b4
56e8880
 
fd13ef2
35c36b4
578adcb
901e92c
35c36b4
6594157
d707ec3
6594157
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
643980c
6594157
 
1818a85
03baf62
6594157
 
 
 
 
578adcb
35c36b4
4cfc3d6
35c36b4
578adcb
 
 
 
35c36b4
 
 
 
 
578adcb
 
 
 
 
35c36b4
 
 
6594157
578adcb
35c36b4
578adcb
35c36b4
578adcb
35c36b4
 
 
578adcb
 
 
 
35c36b4
578adcb
 
 
35c36b4
 
 
578adcb
 
 
35c36b4
 
 
 
 
c1ecfc3
2634260
 
 
643980c
578adcb
7cd2eec
 
 
 
 
 
 
 
 
 
 
 
643980c
7cd2eec
c1ecfc3
643980c
c1ecfc3
 
 
a580f61
c1ecfc3

import streamlit as st
import pandas as pd
from PIL import Image
import base64
from io import BytesIO  

# ─── Page config ──────────────────────────────────────────────────────────────
st.set_page_config(page_title="ExpertLongBench Leaderboard", layout="wide")


logo_image = Image.open("src/ExpertLongBench.png")

# Display logo
buffered = BytesIO()
logo_image.save(buffered, format="PNG")
img_data = base64.b64encode(buffered.getvalue()).decode("utf-8")

st.markdown(
    f"""
    <div class="logo-container" style="display:flex; justify-content: center;">
        <img src="data:image/png;base64,{img_data}" style="width:50%; max-width:700px;"/>
    </div>
    """,
    unsafe_allow_html=True
)

st.markdown(
    '''
    <div class="header">
        <br/>
        <p style="font-size:22px;">
        ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation with Structured Checklists
        </p>
        <p style="font-size:20px;">
              💻 <a href="https://github.com/launchnlp/ExpertLongBench">GitHub</a> | 🤗 <a href="https://huggingface.co/datasets/launch/ExpertLongBench">Public Dataset</a> | 📑 <a href="https://arxiv.org/abs/2506.01241">Paper</a> |    
               ⚙️ <strong>Version</strong>: <strong>V1</strong> | <strong># Models</strong>: 12 | Updated: <strong>May 2025</strong>
        </p>
    </div>
    ''',
    unsafe_allow_html=True
)
# ─── Load data ────────────────────────────────────────────────────────────────
@st.cache_data
def load_data(path="src/models.json"):
    df = pd.read_json(path, lines=True)
    score_cols = [f"T{i}" for i in range(1, 12)]
    df["Avg"] = df[score_cols].mean(axis=1).round(1)
    # Compute rank per column (1 = best)
    for col in score_cols + ["Avg"]:
        df[f"{col}_rank"] = df[col].rank(ascending=False, method="min").astype(int)
    return df

df = load_data()

# Precompute max ranks for color scaling
score_cols = [f"T{i}" for i in range(1, 12)] + ["Avg"]
max_ranks = {col: df[f"{col}_rank"].max() for col in score_cols}

# ─── Tabs ──────────────────────────────────────────────────────────────────────
tab1, tab2 = st.tabs(["Leaderboard", "Benchmark Details"])

with tab1:
    # st.markdown("**Leaderboard:** higher scores shaded green; best models bolded.")
    # Build raw HTML table
    cols = ["Model"] + [f"T{i}" for i in range(1,12)] + ["Avg"]
    html = "<table style='border-collapse:collapse; width:100%; font-size:14px;'>"
    # header
    html += "<tr>" + "".join(f"<th style='padding:6px;'>{col}</th>" for col in cols) + "</tr>"
    # rows
    for _, row in df.iterrows():
        html += "<tr>"
        for col in cols:
            val = row[col]
            if col == "Model":
                html += f"<td style='padding:6px; text-align:left;'>{val}</td>"
            else:
                rank = int(row[f"{col}_rank"])
                norm = 1 - (rank - 1) / ((max_ranks[col] - 1) or 1)
                # interpolate green (182,243,182) → white (255,255,255)
                r = int(255 - norm*(255-182))
                g = int(255 - norm*(255-243))
                b = 255
                bold = "font-weight:bold;" if rank == 1 else ""
                style = f"background-color:rgb({r},{g},{b}); padding:6px; {bold}"
                html += f"<td style='{style}'>{val}</td>"
        html += "</tr>"
    html += "</table>"
    st.markdown(html, unsafe_allow_html=True)

with tab2:
    pipeline_image = Image.open("src/pipeline.png")
    buffered2 = BytesIO()
    pipeline_image.save(buffered2, format="PNG")
    img_data_pipeline = base64.b64encode(buffered2.getvalue()).decode("utf-8")
    st.markdown("## Abstract")
    st.write(
    """
    The paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications.
    Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task includes rubrics, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR to support accurate evaluation of long-form model outputs on our benchmark.

    For fine-grained, expert-aligned evaluation, CLEAR derives checklists from model outputs and reference outputs by extracting information corresponding to items on the task-specific rubrics.
    Checklist items for model outputs are then compared with corresponding items for reference outputs to assess their correctness, enabling grounded evaluation.  

    We benchmark 11 large language models (LLMs) and analyze components in CLEAR, showing that:
    (1) existing LLMs, with the top performer achieving only a 26.8% F1 score, require significant improvement for expert-level tasks;
    (2) models can generate content corresponding to the required aspects, though often not accurately; and
    (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable and low-cost usage.
    """
    )

    
    st.markdown("## Pipeline")
    st.markdown(
    f"""
    <div class="logo-container" style="display:flex; justify-content: center;">
        <img src="data:image/png;base64,{img_data_pipeline}" style="width:90%; max-width:900px;"/>
    </div>
    """,
    unsafe_allow_html=True
)