Spaces:
Running
A newer version of the Gradio SDK is available:
5.35.0
QANTA 2025 Leaderboard Metrics Manual
This document explains the metrics displayed on the QANTA 2025 Human-AI Cooperative QA competition leaderboard.
Tossup Round Metrics
Tossup rounds measure an AI system's ability to answer questions as they're being read, in direct competition with human buzz points:
Metric | Description |
---|---|
Submission | The username and model name of the submission (format: username/model_name ) |
Expected Score ⬆️ | Average points scored per tossup question, using the point scale: +1 for a correct answer, -0.5 for an incorrect buzz, 0 for no buzz. Scores are computed by simulating real competition against human buzz point data: the model only scores if it buzzes before the human, and is penalized if it buzzes incorrectly before the human. |
Buzz Precision | Percentage of correct answers when the model decides to buzz in. Displayed as a percentage (e.g., 65.0%). |
Buzz Frequency | Percentage of questions where the model buzzes in. Displayed as a percentage (e.g., 65.0%). |
Buzz Position | Average (token) position in the question when the model decides to answer. Lower values indicate earlier buzzing. |
Win Rate w/ Humans | Percentage of times the model successfully answers questions when competing with human players before the opponent correctly buzzes. |
Bonus Round Metrics
Bonus rounds test an AI system's ability to answer multi-part questions with right explanation to collaborate with another player. The leaderboard measures the model's effect on a simulated Quizbowl player (Here, gpt-4o-mini
):
Metric | Description |
---|---|
Submission | The username and model name of the submission (format: username/model_name ) |
Effect | The overall effect of the model's responses on a target Quizbowl player's accuracy. Specifically, this is the difference between the net accuracy of a gpt-4o-mini + model team, and the gpt-4o-mini player alone, as measured on the bonus set. In the team setting, the target model samples the response, confidence and explanation to provide the final guess, while the gpt-4o-mini player uses the model's response, confidence and explanation to provide the final guess. |
Question Acc | Percentage of bonus questions where all parts were answered correctly. |
Part Acc | Percentage of individual bonus question parts answered correctly across all questions. |
Calibration | The calibration of the model's confidence in its answers. Specifically, this is calculated as the average of the absolute difference between the confidence score (between 0 and 1) and the binary correctness score (1 for correct, 0 for incorrect), over the bonus set. |
Adoption | The percentage of times the target model adopts the model's guess, confidence and explanation to provide the final guess, as opposed to using its own. |
Understanding the Competition
QANTA (Question Answering is Not a Trivial Activity) is a competition for building AI systems that can answer quiz bowl questions. Quiz bowl is a trivia competition format with:
- Tossup questions: Paragraph-length clues read in sequence where players can buzz in at any point to answer. The leaderboard simulates real competition by using human buzz point data for scoring.
- Bonus questions: Multi-part questions that test depth of knowledge in related areas. The leaderboard measures the effect of models in a team setting with a simulated human (gpt-4o-mini).
The leaderboard tracks how well AI models perform on both question types across different evaluation datasets, using these updated, competition-realistic metrics.