Spaces:

k-mktr
/

gpu-poor-llm-arena

Running

App Files Files Community

gpu-poor-llm-arena / elo_README.md

k-mktr

Create elo_README.md

11c59c2 verified 17 days ago

preview code

raw

history blame

3.94 kB

	# Modified ELO Rating System for AI Model Arena

	This document outlines the mathematical foundations and implementation details of our modified ELO rating system, designed specifically for ranking AI models in competitive scenarios.

	## 1. Introduction to ELO

	The ELO rating system, originally developed for chess rankings, has been adapted for our AI model arena. The core principle remains: after each match, the winning model gains rating points while the losing model loses points. The magnitude of this exchange depends on the relative ratings of the two models and the outcome of the match.

	## 2. Initial Ratings

	Unlike traditional ELO systems that start all participants at the same rating, our system assigns initial ratings based on model size:

	Initial Rating = 1000 + (model_size * 100)

	This approach acknowledges that larger models often have inherent advantages due to their increased capacity and training data.

	## 3. Expected Score Calculation

	The expected score for each model in a match is calculated using the standard ELO formula:

	E(A) = 1 / (1 + 10^((R_B - R_A) / 400))

	Where:
	E(A) is the expected score for model A
	R_A is the current rating of model A
	R_B is the current rating of model B

	## 4. K-Factor Modification

	In standard ELO, the K-factor is a constant that determines the maximum change in rating after a single game. Our system modifies the K-factor based on the size difference between the competing models:

	k_factor = min(64, 32 * (1 + (loser_size - winner_size) / max_size))

	Where:
	loser_size is the size of the losing model
	winner_size is the size of the winning model
	max_size is the size of the largest model in the arena

	This modification allows for larger rating changes when a smaller model defeats a larger one, reflecting the significance of such outcomes.

	## 5. Rating Update Formula

	After each match, the ratings are updated using the following formula:

	R'_A = R_A + k_factor * (S_A - E(A))

	Where:
	R'_A is the new rating for model A
	R_A is the current rating for model A
	S_A is the actual score (1 for win, 0 for loss)
	E(A) is the expected score for model A

	## 6. Impact Scores

	To provide additional insight into model performance, we calculate two impact scores:

	### Positive Impact

	positive_impact += wins * (1 + max(0, (opponent_size - model_size) / max_size))

	This score is higher when a model wins against larger opponents, reflecting the significance of these victories.

	### Negative Impact

	negative_impact += losses * (1 + max(0, (model_size - opponent_size) / max_size))

	This score is higher when a model loses against smaller opponents, indicating potential underperformance.

	## 7. Advantages of Our Modified System

	1. Size-Aware Initialization: Larger models start with higher ratings, reflecting their potential advantages.
	2. Dynamic K-Factor: The K-factor adapts based on model size differences, allowing for more significant rating changes in upset scenarios.
	3. Impact Scores: These provide additional context beyond the raw ELO rating, helping to interpret a model's performance against various opponents.

	## 8. Limitations and Considerations

	1. Sensitivity to Initial Conditions: The system's reliance on model size for initialization may need refinement as more data is collected.
	2. Potential for Rating Inflation: As with all ELO systems, there's a possibility of rating inflation over time. Periodic resets or normalization may be necessary.
	3. Assumption of Transitive Performance: The system assumes that if A beats B and B beats C, then A is likely to beat C. This may not always hold true in complex AI model comparisons.

	## Conclusion

	Our modified ELO system aims to provide a fair and insightful ranking mechanism for AI models, taking into account the unique characteristics of these competitions. As we gather more data and insights, we may further refine this system to improve its accuracy and interpretability.