Spaces:
Running
Running
# QANTA 2025 Leaderboard Metrics Manual | |
This document explains the metrics displayed on the QANTA 2025 Human-AI Cooperative QA competition leaderboard. | |
## Tossup Round Metrics | |
Tossup rounds measure an AI system's ability to answer questions as they're being read, in direct competition with human buzz points: | |
| Metric | Description | | |
|--------|-------------| | |
| **Submission** | The username and model name of the submission (format: `username/model_name`) | | |
| **Expected Score ⬆️** | Average points scored per tossup question, using the point scale: **+1 for a correct answer, -0.5 for an incorrect buzz, 0 for no buzz**. Scores are computed by simulating real competition against human buzz point data: the model only scores if it buzzes before the human, and is penalized if it buzzes incorrectly before the human. | | |
| **Buzz Precision** | Percentage of correct answers when the model decides to buzz in. Displayed as a percentage (e.g., 65.0%). | | |
| **Buzz Frequency** | Percentage of questions where the model buzzes in. Displayed as a percentage (e.g., 65.0%). | | |
| **Buzz Position** | Average (token) position in the question when the model decides to answer. Lower values indicate earlier buzzing. | | |
| **Win Rate w/ Humans** | Percentage of times the model successfully answers questions when competing with human players before the opponent correctly buzzes. | | |
## Bonus Round Metrics | |
Bonus rounds test an AI system's ability to answer multi-part questions with right explanation to collaborate with another player. The leaderboard measures the model's effect on a simulated Quizbowl player (Here, `gpt-4o-mini`): | |
| Metric | Description | | |
|--------|-------------| | |
| **Submission** | The username and model name of the submission (format: `username/model_name`) | | |
| **Effect** | The overall effect of the model's responses on a target Quizbowl player's accuracy. Specifically, this is the difference between the net accuracy of a gpt-4o-mini + model team, and the gpt-4o-mini player alone, as measured on the bonus set. In the team setting, the target model samples the response, confidence and explanation to provide the final guess, while the gpt-4o-mini player uses the model's response, confidence and explanation to provide the final guess.| | |
| **Question Acc** | Percentage of bonus questions where all parts were answered correctly. | | |
| **Part Acc** | Percentage of individual bonus question parts answered correctly across all questions. | | |
| **Calibration** | The calibration of the model's confidence in its answers. Specifically, this is calculated as the average of the absolute difference between the confidence score (between 0 and 1) and the binary correctness score (1 for correct, 0 for incorrect), over the bonus set. | | |
| **Adoption** | The percentage of times the target model adopts the model's guess, confidence and explanation to provide the final guess, as opposed to using its own. | | |
## Understanding the Competition | |
QANTA (Question Answering is Not a Trivial Activity) is a competition for building AI systems that can answer quiz bowl questions. Quiz bowl is a trivia competition format with: | |
1. **Tossup questions**: Paragraph-length clues read in sequence where players can buzz in at any point to answer. The leaderboard simulates real competition by using human buzz point data for scoring. | |
2. **Bonus questions**: Multi-part questions that test depth of knowledge in related areas. The leaderboard measures the effect of models in a team setting with a simulated human (gpt-4o-mini). | |
The leaderboard tracks how well AI models perform on both question types across different evaluation datasets, using these updated, competition-realistic metrics. | |