Spaces:

qanta-challenge
/

leaderboard

Running

App Files Files Community

leaderboard / metrics_manual.md

Maharshi Gor

Update with new metrics

b2cdb46 about 2 months ago

preview code

raw

history blame contribute delete

3.7 kB

	# QANTA 2025 Leaderboard Metrics Manual

	This document explains the metrics displayed on the QANTA 2025 Human-AI Cooperative QA competition leaderboard.

	## Tossup Round Metrics

	Tossup rounds measure an AI system's ability to answer questions as they're being read, in direct competition with human buzz points:

	\| Metric \| Description \|
	\|--------\|-------------\|
	\| Submission \| The username and model name of the submission (format: `username/model_name`) \|
	\| Expected Score ⬆️ \| Average points scored per tossup question, using the point scale: +1 for a correct answer, -0.5 for an incorrect buzz, 0 for no buzz. Scores are computed by simulating real competition against human buzz point data: the model only scores if it buzzes before the human, and is penalized if it buzzes incorrectly before the human. \|
	\| Buzz Precision \| Percentage of correct answers when the model decides to buzz in. Displayed as a percentage (e.g., 65.0%). \|
	\| Buzz Frequency \| Percentage of questions where the model buzzes in. Displayed as a percentage (e.g., 65.0%). \|
	\| Buzz Position \| Average (token) position in the question when the model decides to answer. Lower values indicate earlier buzzing. \|
	\| Win Rate w/ Humans \| Percentage of times the model successfully answers questions when competing with human players before the opponent correctly buzzes. \|

	## Bonus Round Metrics

	Bonus rounds test an AI system's ability to answer multi-part questions with right explanation to collaborate with another player. The leaderboard measures the model's effect on a simulated Quizbowl player (Here, `gpt-4o-mini`):

	\| Metric \| Description \|
	\|--------\|-------------\|
	\| Submission \| The username and model name of the submission (format: `username/model_name`) \|
	\| Effect \| The overall effect of the model's responses on a target Quizbowl player's accuracy. Specifically, this is the difference between the net accuracy of a gpt-4o-mini + model team, and the gpt-4o-mini player alone, as measured on the bonus set. In the team setting, the target model samples the response, confidence and explanation to provide the final guess, while the gpt-4o-mini player uses the model's response, confidence and explanation to provide the final guess.\|
	\| Question Acc \| Percentage of bonus questions where all parts were answered correctly. \|
	\| Part Acc \| Percentage of individual bonus question parts answered correctly across all questions. \|
	\| Calibration \| The calibration of the model's confidence in its answers. Specifically, this is calculated as the average of the absolute difference between the confidence score (between 0 and 1) and the binary correctness score (1 for correct, 0 for incorrect), over the bonus set. \|
	\| Adoption \| The percentage of times the target model adopts the model's guess, confidence and explanation to provide the final guess, as opposed to using its own. \|

	## Understanding the Competition

	QANTA (Question Answering is Not a Trivial Activity) is a competition for building AI systems that can answer quiz bowl questions. Quiz bowl is a trivia competition format with:

	1. Tossup questions: Paragraph-length clues read in sequence where players can buzz in at any point to answer. The leaderboard simulates real competition by using human buzz point data for scoring.
	2. Bonus questions: Multi-part questions that test depth of knowledge in related areas. The leaderboard measures the effect of models in a team setting with a simulated human (gpt-4o-mini).

	The leaderboard tracks how well AI models perform on both question types across different evaluation datasets, using these updated, competition-realistic metrics.

	# QANTA 2025 Leaderboard Metrics Manual

	This document explains the metrics displayed on the QANTA 2025 Human-AI Cooperative QA competition leaderboard.

	## Tossup Round Metrics

	Tossup rounds measure an AI system's ability to answer questions as they're being read, in direct competition with human buzz points:

	\| Metric \| Description \|
	\|--------\|-------------\|
	\| Submission \| The username and model name of the submission (format: `username/model_name`) \|
	\| Expected Score ⬆️ \| Average points scored per tossup question, using the point scale: +1 for a correct answer, -0.5 for an incorrect buzz, 0 for no buzz. Scores are computed by simulating real competition against human buzz point data: the model only scores if it buzzes before the human, and is penalized if it buzzes incorrectly before the human. \|
	\| Buzz Precision \| Percentage of correct answers when the model decides to buzz in. Displayed as a percentage (e.g., 65.0%). \|
	\| Buzz Frequency \| Percentage of questions where the model buzzes in. Displayed as a percentage (e.g., 65.0%). \|
	\| Buzz Position \| Average (token) position in the question when the model decides to answer. Lower values indicate earlier buzzing. \|
	\| Win Rate w/ Humans \| Percentage of times the model successfully answers questions when competing with human players before the opponent correctly buzzes. \|

	## Bonus Round Metrics

	Bonus rounds test an AI system's ability to answer multi-part questions with right explanation to collaborate with another player. The leaderboard measures the model's effect on a simulated Quizbowl player (Here, `gpt-4o-mini`):

	\| Metric \| Description \|
	\|--------\|-------------\|
	\| Submission \| The username and model name of the submission (format: `username/model_name`) \|
	\| Effect \| The overall effect of the model's responses on a target Quizbowl player's accuracy. Specifically, this is the difference between the net accuracy of a gpt-4o-mini + model team, and the gpt-4o-mini player alone, as measured on the bonus set. In the team setting, the target model samples the response, confidence and explanation to provide the final guess, while the gpt-4o-mini player uses the model's response, confidence and explanation to provide the final guess.\|
	\| Question Acc \| Percentage of bonus questions where all parts were answered correctly. \|
	\| Part Acc \| Percentage of individual bonus question parts answered correctly across all questions. \|
	\| Calibration \| The calibration of the model's confidence in its answers. Specifically, this is calculated as the average of the absolute difference between the confidence score (between 0 and 1) and the binary correctness score (1 for correct, 0 for incorrect), over the bonus set. \|
	\| Adoption \| The percentage of times the target model adopts the model's guess, confidence and explanation to provide the final guess, as opposed to using its own. \|

	## Understanding the Competition

	QANTA (Question Answering is Not a Trivial Activity) is a competition for building AI systems that can answer quiz bowl questions. Quiz bowl is a trivia competition format with:

	1. Tossup questions: Paragraph-length clues read in sequence where players can buzz in at any point to answer. The leaderboard simulates real competition by using human buzz point data for scoring.
	2. Bonus questions: Multi-part questions that test depth of knowledge in related areas. The leaderboard measures the effect of models in a team setting with a simulated human (gpt-4o-mini).

	The leaderboard tracks how well AI models perform on both question types across different evaluation datasets, using these updated, competition-realistic metrics.