Tennis Match Win Streak Analysis
This project aims to analyze the most challenging path to a tennis historical winning-streak using machine learning and deep learning techniques. The model incorporates advanced feature engineering, data preprocessing, and a combination of neural networks and ensemble models to achieve accurate analysis of the level of competition.
Table of Contents
- Features
- Installation
- Usage
- Data
- Model Architecture
- Hyperparameter Tuning
- Results
- Contributing
- License
Features
- Advanced feature engineering, including rank and Elo rating differences.
- Data preprocessing with error handling and missing value management.
- PyTorch Lightning framework for building and training neural networks.
- Hyperparameter optimization using Optuna.
- Ensemble methods for improved accuracy.
- Winning streak analysis using clustering techniques.
Installation
To get started with this project, you need to have Python 3.x installed. You can then install the required packages using pip: pip install -r requirements.txt
Usage
Place your match data CSV files (named PlayerMatches2.csv to PlayerMatches15.csv) in the project directory.
Run the script to load the data, preprocess it, and train the models:
python main.py
The script will save the best model and configuration for later analysis.
Data
The project uses historical player match data in CSV format. Each file should contain the following columns:
date
tournament
winner_name
winner_rank
winner_eloRating
loser_name
loser_rank
loser_eloRating
Optional columns for enhanced feature engineering can also be included.
Model Architecture
The project utilizes a custom neural network with:
Categorical embeddings for player names and other categorical features.
Fully connected layers to process both embedded and numerical input features.
Dropout layers for regularization.
Hyperparameter Tuning
Hyperparameter optimization is performed using Optuna, allowing for fine-tuning of:
Embedding dimensions
Hidden layer sizes
Learning rates
Dropout rates
Batch sizes
Results
The model's performance is evaluated using mean squared error (MSE) on the validation set. Ensemble models are also trained and compared for additional insights.
Contributing
Contributions are welcome! If you have suggestions for improvements or new features, feel free to submit a pull request or open an issue.
License
This project is licensed under the MIT License. See the LICENSE file for more information