Spaces:
Sleeping
Sleeping
Commit
·
5271c2e
1
Parent(s):
0e63702
Update README and UFC data, retrain models
Browse filesExpanded and clarified the README with detailed usage instructions for scraping, prediction, and pipeline execution. Updated ufc_fights.csv with new event results, added output/last_event.json, and refreshed model artifacts and results to reflect retraining on the latest data.
- README.md +38 -8
- output/last_event.json +3 -0
- output/model_results.json +2 -2
- output/models/BernoulliNBModel.joblib +2 -2
- output/models/LGBMModel.joblib +2 -2
- output/models/LogisticRegressionModel.joblib +2 -2
- output/models/RandomForestModel.joblib +2 -2
- output/models/SVCModel.joblib +2 -2
- output/models/XGBoostModel.joblib +2 -2
- output/ufc_fights.csv +0 -0
- src/config.py +2 -2
- src/main.py +94 -3
- src/predict/main.py +38 -3
- src/predict/models.py +11 -3
- src/predict/pipeline.py +160 -11
- src/predict/predict_new.py +6 -1
- src/predict/preprocess.py +11 -0
- src/scrape/main.py +162 -5
- src/scrape/scrape_fighters.py +3 -3
- src/scrape/scrape_fights.py +51 -6
README.md
CHANGED
@@ -19,20 +19,50 @@ pinned: false
|
|
19 |
```bash
|
20 |
pip install -r requirements.txt
|
21 |
```
|
22 |
-
## Scraping:
|
23 |
-
Scrape ALL fight and fighter data from [ufcstats.com](http://ufcstats.com) up to the latest event and save them in `.csv` format
|
24 |
|
25 |
-
|
26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
```bash
|
28 |
-
python -m src.
|
29 |
```
|
30 |
-
|
31 |
|
32 |
-
|
|
|
|
|
|
|
|
|
33 |
|
34 |
-
|
35 |
|
36 |
```bash
|
37 |
-
python -m src.
|
38 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
```bash
|
20 |
pip install -r requirements.txt
|
21 |
```
|
|
|
|
|
22 |
|
23 |
+
## Usage
|
24 |
|
25 |
+
### 1. Data Scraping
|
26 |
+
|
27 |
+
**Initial Setup (First Time):**
|
28 |
+
```bash
|
29 |
+
python -m src.main --pipeline scrape --scrape-mode full
|
30 |
+
```
|
31 |
+
Scrapes all historical fight data from ufcstats.com.
|
32 |
+
|
33 |
+
**Update Data (Regular Use):**
|
34 |
+
```bash
|
35 |
+
python -m src.main --pipeline scrape --scrape-mode update
|
36 |
+
```
|
37 |
+
Adds only the latest events to existing data.
|
38 |
+
|
39 |
+
### 2. Fight Prediction
|
40 |
+
|
41 |
+
**Use Existing Models (Fast):**
|
42 |
```bash
|
43 |
+
python -m src.main --pipeline predict
|
44 |
```
|
45 |
+
Loads saved models if available and retrains if new data available.
|
46 |
|
47 |
+
**Force Retrain Models:**
|
48 |
+
```bash
|
49 |
+
python -m src.main --pipeline predict --force-retrain
|
50 |
+
```
|
51 |
+
Always retrains all models from scratch with latest data. This is useful for when the way training models changes
|
52 |
|
53 |
+
### 3. Complete Pipeline
|
54 |
|
55 |
```bash
|
56 |
+
python -m src.main --pipeline all --scrape-mode update
|
57 |
```
|
58 |
+
Runs scraping (update mode), analysis, and prediction in sequence.
|
59 |
+
|
60 |
+
## Model Performance
|
61 |
+
|
62 |
+
The system tests on the latest UFC event for realistic accuracy scores (typically 50-70% for fight prediction).
|
63 |
+
|
64 |
+
## Output
|
65 |
+
|
66 |
+
- **Data:** `output/ufc_fights.csv`, `output/ufc_fighters.csv`
|
67 |
+
- **Models:** `output/models/*.joblib`
|
68 |
+
- **Results:** `output/model_results.json`
|
output/last_event.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a6437dbe76de54ac99372958849c4fda0baab3fe5dae46844de8201f4df7ea50
|
3 |
+
size 168
|
output/model_results.json
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:bf8df1ba9e26fa98e34bfb1c773e66576cbf89152087c55b70921269c84f39d5
|
3 |
+
size 27286
|
output/models/BernoulliNBModel.joblib
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:5ff1f1701e009137de1325c65eda57ff32444f723b07d6bc9bf0dd5b87d4dd01
|
3 |
+
size 5344949
|
output/models/LGBMModel.joblib
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a2acd855ed50d393d06119fc0a3cff73e7a2e1affe2d387e631169b52e8083dd
|
3 |
+
size 6657224
|
output/models/LogisticRegressionModel.joblib
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7a773552b7f1b166858ab1ff7bdf472e24b293279a8e24871de773b1a3de46e1
|
3 |
+
size 5517988
|
output/models/RandomForestModel.joblib
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:100ab12c17d233b9ac97e75a8d81cf339c0d7cbd7f17050005f535f2965a67cd
|
3 |
+
size 49715610
|
output/models/SVCModel.joblib
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e4db6a11d4082ffa4d8626e485959c42868553380a7dabfc93db55bceaecd873
|
3 |
+
size 7204520
|
output/models/XGBoostModel.joblib
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:901938289fd8ac976f04be6ae72ba6ea9df9dcda4d6d37955f47bb9fdf2acd30
|
3 |
+
size 6070396
|
output/ufc_fights.csv
CHANGED
The diff for this file is too large to render.
See raw diff
|
|
src/config.py
CHANGED
@@ -5,6 +5,6 @@ MODELS_DIR = os.path.join(OUTPUT_DIR, 'models')
|
|
5 |
MODEL_RESULTS_PATH = os.path.join(OUTPUT_DIR, 'model_results.json')
|
6 |
FIGHTS_CSV_PATH = os.path.join(OUTPUT_DIR, 'ufc_fights.csv')
|
7 |
FIGHTERS_CSV_PATH = os.path.join(OUTPUT_DIR, 'ufc_fighters.csv')
|
8 |
-
UPCOMING_EVENTS_JSON_PATH = os.path.join(OUTPUT_DIR, 'upcoming_events.json')
|
9 |
EVENTS_JSON_PATH = os.path.join(OUTPUT_DIR, 'events.json')
|
10 |
-
|
|
|
|
5 |
MODEL_RESULTS_PATH = os.path.join(OUTPUT_DIR, 'model_results.json')
|
6 |
FIGHTS_CSV_PATH = os.path.join(OUTPUT_DIR, 'ufc_fights.csv')
|
7 |
FIGHTERS_CSV_PATH = os.path.join(OUTPUT_DIR, 'ufc_fighters.csv')
|
|
|
8 |
EVENTS_JSON_PATH = os.path.join(OUTPUT_DIR, 'events.json')
|
9 |
+
FIGHTERS_JSON_PATH = os.path.join(OUTPUT_DIR, 'fighters.json')
|
10 |
+
LAST_EVENT_JSON_PATH = os.path.join(OUTPUT_DIR, 'last_event.json')
|
src/main.py
CHANGED
@@ -1,5 +1,96 @@
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
2 |
|
3 |
-
|
4 |
-
|
5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import argparse
|
2 |
+
import sys
|
3 |
+
import os
|
4 |
|
5 |
+
# Add the current directory to Python path for imports
|
6 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
7 |
|
8 |
+
def main():
|
9 |
+
"""
|
10 |
+
Main entry point for the UFC data pipeline.
|
11 |
+
Supports scraping, analysis, and prediction workflows.
|
12 |
+
"""
|
13 |
+
parser = argparse.ArgumentParser(description="UFC Data Pipeline")
|
14 |
+
parser.add_argument(
|
15 |
+
'--pipeline',
|
16 |
+
type=str,
|
17 |
+
default='scrape',
|
18 |
+
choices=['scrape', 'analysis', 'predict', 'all'],
|
19 |
+
help="Pipeline to run: 'scrape', 'analysis', 'predict', or 'all'"
|
20 |
+
)
|
21 |
+
parser.add_argument(
|
22 |
+
'--scrape-mode',
|
23 |
+
type=str,
|
24 |
+
default='full',
|
25 |
+
choices=['full', 'update'],
|
26 |
+
help="Scraping mode: 'full' (complete scraping) or 'update' (latest events only)"
|
27 |
+
)
|
28 |
+
parser.add_argument(
|
29 |
+
'--num-events',
|
30 |
+
type=int,
|
31 |
+
default=5,
|
32 |
+
help="Number of latest events to scrape in update mode (default: 5)"
|
33 |
+
)
|
34 |
+
# Model management arguments for prediction pipeline
|
35 |
+
parser.add_argument(
|
36 |
+
'--use-existing-models',
|
37 |
+
action='store_true',
|
38 |
+
default=True,
|
39 |
+
help="Use existing saved models if available and no new data (default: True)."
|
40 |
+
)
|
41 |
+
parser.add_argument(
|
42 |
+
'--no-use-existing-models',
|
43 |
+
action='store_true',
|
44 |
+
default=False,
|
45 |
+
help="Force retrain all models from scratch, ignoring existing saved models."
|
46 |
+
)
|
47 |
+
parser.add_argument(
|
48 |
+
'--force-retrain',
|
49 |
+
action='store_true',
|
50 |
+
default=False,
|
51 |
+
help="Force retrain all models even if no new data is available."
|
52 |
+
)
|
53 |
+
|
54 |
+
args = parser.parse_args()
|
55 |
+
|
56 |
+
if args.pipeline in ['scrape', 'all']:
|
57 |
+
print("=== Running Scraping Pipeline ===")
|
58 |
+
from scrape.main import main as scrape_main
|
59 |
+
|
60 |
+
# Override sys.argv to pass arguments to scrape.main
|
61 |
+
original_argv = sys.argv
|
62 |
+
sys.argv = ['scrape_main', '--mode', args.scrape_mode, '--num-events', str(args.num_events)]
|
63 |
+
try:
|
64 |
+
scrape_main()
|
65 |
+
finally:
|
66 |
+
sys.argv = original_argv
|
67 |
+
|
68 |
+
if args.pipeline in ['analysis', 'all']:
|
69 |
+
print("\n=== Running ELO Analysis ===")
|
70 |
+
from analysis.elo import main as elo_main
|
71 |
+
elo_main()
|
72 |
+
|
73 |
+
if args.pipeline in ['predict', 'all']:
|
74 |
+
print("\n=== Running Prediction Pipeline ===")
|
75 |
+
from predict.main import main as predict_main
|
76 |
+
|
77 |
+
# Override sys.argv to pass model management arguments to predict.main
|
78 |
+
original_argv = sys.argv
|
79 |
+
predict_args = ['predict_main']
|
80 |
+
|
81 |
+
if args.no_use_existing_models:
|
82 |
+
predict_args.append('--no-use-existing-models')
|
83 |
+
elif args.use_existing_models:
|
84 |
+
predict_args.append('--use-existing-models')
|
85 |
+
|
86 |
+
if args.force_retrain:
|
87 |
+
predict_args.append('--force-retrain')
|
88 |
+
|
89 |
+
sys.argv = predict_args
|
90 |
+
try:
|
91 |
+
predict_main()
|
92 |
+
finally:
|
93 |
+
sys.argv = original_argv
|
94 |
+
|
95 |
+
if __name__ == '__main__':
|
96 |
+
main()
|
src/predict/main.py
CHANGED
@@ -1,6 +1,8 @@
|
|
1 |
import argparse
|
2 |
-
|
3 |
-
|
|
|
|
|
4 |
EloBaselineModel,
|
5 |
LogisticRegressionModel,
|
6 |
XGBoostModel,
|
@@ -23,8 +25,37 @@ def main():
|
|
23 |
choices=['detailed', 'summary'],
|
24 |
help="Type of report to generate: 'detailed' (file) or 'summary' (console)."
|
25 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
args = parser.parse_args()
|
27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
# --- Define Models to Run ---
|
29 |
# Instantiate all the models you want to evaluate here.
|
30 |
models_to_run = [
|
@@ -38,7 +69,11 @@ def main():
|
|
38 |
]
|
39 |
# --- End of Model Definition ---
|
40 |
|
41 |
-
pipeline = PredictionPipeline(
|
|
|
|
|
|
|
|
|
42 |
|
43 |
try:
|
44 |
pipeline.run(detailed_report=(args.report == 'detailed'))
|
|
|
1 |
import argparse
|
2 |
+
|
3 |
+
# Use absolute imports to avoid relative import issues
|
4 |
+
from src.predict.pipeline import PredictionPipeline
|
5 |
+
from src.predict.models import (
|
6 |
EloBaselineModel,
|
7 |
LogisticRegressionModel,
|
8 |
XGBoostModel,
|
|
|
25 |
choices=['detailed', 'summary'],
|
26 |
help="Type of report to generate: 'detailed' (file) or 'summary' (console)."
|
27 |
)
|
28 |
+
parser.add_argument(
|
29 |
+
'--use-existing-models',
|
30 |
+
action='store_true',
|
31 |
+
default=True,
|
32 |
+
help="Use existing saved models if available and no new data (default: True)."
|
33 |
+
)
|
34 |
+
parser.add_argument(
|
35 |
+
'--no-use-existing-models',
|
36 |
+
action='store_true',
|
37 |
+
default=False,
|
38 |
+
help="Force retrain all models from scratch, ignoring existing saved models."
|
39 |
+
)
|
40 |
+
parser.add_argument(
|
41 |
+
'--force-retrain',
|
42 |
+
action='store_true',
|
43 |
+
default=False,
|
44 |
+
help="Force retrain all models even if no new data is available."
|
45 |
+
)
|
46 |
args = parser.parse_args()
|
47 |
|
48 |
+
# Handle conflicting arguments
|
49 |
+
use_existing_models = not args.no_use_existing_models and args.use_existing_models
|
50 |
+
force_retrain = args.force_retrain
|
51 |
+
|
52 |
+
if args.no_use_existing_models:
|
53 |
+
print("No-use-existing-models flag set: All models will be retrained from scratch.")
|
54 |
+
elif force_retrain:
|
55 |
+
print("Force-retrain flag set: All models will be retrained regardless of new data.")
|
56 |
+
elif use_existing_models:
|
57 |
+
print("Using existing models if available and no new data detected.")
|
58 |
+
|
59 |
# --- Define Models to Run ---
|
60 |
# Instantiate all the models you want to evaluate here.
|
61 |
models_to_run = [
|
|
|
69 |
]
|
70 |
# --- End of Model Definition ---
|
71 |
|
72 |
+
pipeline = PredictionPipeline(
|
73 |
+
models=models_to_run,
|
74 |
+
use_existing_models=use_existing_models,
|
75 |
+
force_retrain=force_retrain
|
76 |
+
)
|
77 |
|
78 |
try:
|
79 |
pipeline.run(detailed_report=(args.report == 'detailed'))
|
src/predict/models.py
CHANGED
@@ -1,7 +1,6 @@
|
|
1 |
from abc import ABC, abstractmethod
|
2 |
import sys
|
3 |
import os
|
4 |
-
from ..analysis.elo import process_fights_for_elo, INITIAL_ELO
|
5 |
import pandas as pd
|
6 |
from sklearn.linear_model import LogisticRegression
|
7 |
from sklearn.svm import SVC
|
@@ -9,8 +8,17 @@ from sklearn.naive_bayes import BernoulliNB
|
|
9 |
from sklearn.ensemble import RandomForestClassifier
|
10 |
from xgboost import XGBClassifier
|
11 |
from lightgbm import LGBMClassifier
|
12 |
-
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
class BaseModel(ABC):
|
16 |
"""
|
|
|
1 |
from abc import ABC, abstractmethod
|
2 |
import sys
|
3 |
import os
|
|
|
4 |
import pandas as pd
|
5 |
from sklearn.linear_model import LogisticRegression
|
6 |
from sklearn.svm import SVC
|
|
|
8 |
from sklearn.ensemble import RandomForestClassifier
|
9 |
from xgboost import XGBClassifier
|
10 |
from lightgbm import LGBMClassifier
|
11 |
+
|
12 |
+
# Use absolute imports to avoid relative import issues
|
13 |
+
try:
|
14 |
+
from src.analysis.elo import process_fights_for_elo, INITIAL_ELO
|
15 |
+
from src.config import FIGHTERS_CSV_PATH
|
16 |
+
from src.predict.preprocess import preprocess_for_ml, _get_fighter_history_stats, _calculate_age
|
17 |
+
except ImportError:
|
18 |
+
# Fallback for when running directly
|
19 |
+
from ..analysis.elo import process_fights_for_elo, INITIAL_ELO
|
20 |
+
from ..config import FIGHTERS_CSV_PATH
|
21 |
+
from .preprocess import preprocess_for_ml, _get_fighter_history_stats, _calculate_age
|
22 |
|
23 |
class BaseModel(ABC):
|
24 |
"""
|
src/predict/pipeline.py
CHANGED
@@ -6,22 +6,139 @@ from collections import OrderedDict
|
|
6 |
import json
|
7 |
import joblib
|
8 |
|
9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
from .models import BaseModel
|
11 |
|
12 |
class PredictionPipeline:
|
13 |
"""
|
14 |
Orchestrates the model training, evaluation, and reporting pipeline.
|
15 |
"""
|
16 |
-
def __init__(self, models):
|
17 |
if not all(isinstance(m, BaseModel) for m in models):
|
18 |
raise TypeError("All models must be instances of BaseModel.")
|
19 |
self.models = models
|
20 |
self.train_fights = []
|
21 |
self.test_fights = []
|
22 |
self.results = {}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
-
def
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
"""Loads and splits the data into chronological training and testing sets."""
|
26 |
print("\n--- Loading and Splitting Data ---")
|
27 |
if not os.path.exists(FIGHTS_CSV_PATH):
|
@@ -41,7 +158,7 @@ class PredictionPipeline:
|
|
41 |
self.train_fights = [f for f in fights if f['event_name'] not in test_event_names]
|
42 |
self.test_fights = [f for f in fights if f['event_name'] in test_event_names]
|
43 |
print(f"Data loaded. {len(self.train_fights)} training fights, {len(self.test_fights)} testing fights.")
|
44 |
-
print(f"Testing on the last {num_test_events}
|
45 |
|
46 |
def run(self, detailed_report=True):
|
47 |
"""Executes the full pipeline: load, train, evaluate, report and save models."""
|
@@ -52,10 +169,24 @@ class PredictionPipeline:
|
|
52 |
print("No fights with definitive outcomes in the test set. Aborting.")
|
53 |
return
|
54 |
|
55 |
-
|
|
|
|
|
56 |
model_name = model.__class__.__name__
|
57 |
print(f"\n--- Evaluating Model: {model_name} ---")
|
58 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
model.train(self.train_fights)
|
60 |
|
61 |
correct_predictions = 0
|
@@ -84,10 +215,12 @@ class PredictionPipeline:
|
|
84 |
})
|
85 |
|
86 |
accuracy = (correct_predictions / len(eval_fights)) * 100
|
|
|
87 |
self.results[model_name] = {
|
88 |
'accuracy': accuracy,
|
89 |
'predictions': predictions,
|
90 |
-
'total_fights': len(eval_fights)
|
|
|
91 |
}
|
92 |
|
93 |
if detailed_report:
|
@@ -95,7 +228,9 @@ class PredictionPipeline:
|
|
95 |
else:
|
96 |
self._report_summary()
|
97 |
|
98 |
-
|
|
|
|
|
99 |
|
100 |
def _train_and_save_models(self):
|
101 |
"""Trains all models on the full dataset and saves them."""
|
@@ -114,6 +249,13 @@ class PredictionPipeline:
|
|
114 |
os.makedirs(MODELS_DIR)
|
115 |
print(f"Created directory: {MODELS_DIR}")
|
116 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
117 |
for model in self.models:
|
118 |
model_name = model.__class__.__name__
|
119 |
print(f"\n--- Training: {model_name} ---")
|
@@ -125,14 +267,20 @@ class PredictionPipeline:
|
|
125 |
joblib.dump(model, save_path)
|
126 |
print(f"Model saved successfully to {save_path}")
|
127 |
|
|
|
|
|
|
|
|
|
|
|
128 |
def _report_summary(self):
|
129 |
"""Prints a concise summary of model performance."""
|
130 |
print("\n\n--- Prediction Pipeline Summary ---")
|
131 |
-
print(f"{'Model':<25} | {'Accuracy':<10} | {'Fights Evaluated':<20}")
|
132 |
-
print("-" *
|
133 |
for model_name, result in self.results.items():
|
134 |
-
|
135 |
-
|
|
|
136 |
|
137 |
def _save_report_to_json(self, file_path=MODEL_RESULTS_PATH):
|
138 |
"""Saves the detailed prediction results to a JSON file."""
|
@@ -153,6 +301,7 @@ class PredictionPipeline:
|
|
153 |
report[model_name] = {
|
154 |
"overall_accuracy": f"{result['accuracy']:.2f}%",
|
155 |
"total_fights_evaluated": result['total_fights'],
|
|
|
156 |
"predictions_by_event": predictions_by_event
|
157 |
}
|
158 |
|
|
|
6 |
import json
|
7 |
import joblib
|
8 |
|
9 |
+
# Use absolute imports to avoid relative import issues
|
10 |
+
try:
|
11 |
+
from src.config import FIGHTS_CSV_PATH, MODEL_RESULTS_PATH, MODELS_DIR, LAST_EVENT_JSON_PATH
|
12 |
+
except ImportError:
|
13 |
+
# Fallback for when running directly
|
14 |
+
from ..config import FIGHTS_CSV_PATH, MODEL_RESULTS_PATH, MODELS_DIR, LAST_EVENT_JSON_PATH
|
15 |
+
|
16 |
from .models import BaseModel
|
17 |
|
18 |
class PredictionPipeline:
|
19 |
"""
|
20 |
Orchestrates the model training, evaluation, and reporting pipeline.
|
21 |
"""
|
22 |
+
def __init__(self, models, use_existing_models=True, force_retrain=False):
|
23 |
if not all(isinstance(m, BaseModel) for m in models):
|
24 |
raise TypeError("All models must be instances of BaseModel.")
|
25 |
self.models = models
|
26 |
self.train_fights = []
|
27 |
self.test_fights = []
|
28 |
self.results = {}
|
29 |
+
self.use_existing_models = use_existing_models
|
30 |
+
self.force_retrain = force_retrain
|
31 |
+
|
32 |
+
def _get_last_trained_event(self):
|
33 |
+
"""Get the last event that models were trained on."""
|
34 |
+
if not os.path.exists(LAST_EVENT_JSON_PATH):
|
35 |
+
return None
|
36 |
+
try:
|
37 |
+
with open(LAST_EVENT_JSON_PATH, 'r', encoding='utf-8') as f:
|
38 |
+
last_event_data = json.load(f)
|
39 |
+
if isinstance(last_event_data, list) and len(last_event_data) > 0:
|
40 |
+
return last_event_data[0].get('name'), last_event_data[0].get('date')
|
41 |
+
return None, None
|
42 |
+
except (json.JSONDecodeError, FileNotFoundError):
|
43 |
+
return None, None
|
44 |
+
|
45 |
+
def _save_last_trained_event(self, event_name, event_date):
|
46 |
+
"""Save the last event that models were trained on."""
|
47 |
+
last_event_data = [{
|
48 |
+
"name": event_name,
|
49 |
+
"date": event_date,
|
50 |
+
"training_timestamp": datetime.now().isoformat()
|
51 |
+
}]
|
52 |
+
try:
|
53 |
+
with open(LAST_EVENT_JSON_PATH, 'w', encoding='utf-8') as f:
|
54 |
+
json.dump(last_event_data, f, indent=4)
|
55 |
+
except Exception as e:
|
56 |
+
print(f"Warning: Could not save last trained event: {e}")
|
57 |
+
|
58 |
+
def _has_new_data_since_last_training(self):
|
59 |
+
"""Check if there's new fight data since the last training."""
|
60 |
+
last_event_name, last_event_date = self._get_last_trained_event()
|
61 |
+
if not last_event_name or not last_event_date:
|
62 |
+
return True # No previous training record, consider as new data
|
63 |
+
|
64 |
+
if not os.path.exists(FIGHTS_CSV_PATH):
|
65 |
+
return False
|
66 |
+
|
67 |
+
with open(FIGHTS_CSV_PATH, 'r', encoding='utf-8') as f:
|
68 |
+
fights = list(csv.DictReader(f))
|
69 |
+
|
70 |
+
if not fights:
|
71 |
+
return False
|
72 |
+
|
73 |
+
# Sort fights by date to get the latest event
|
74 |
+
fights.sort(key=lambda x: datetime.strptime(x['event_date'], '%B %d, %Y'))
|
75 |
+
latest_fight = fights[-1]
|
76 |
+
latest_event_name = latest_fight['event_name']
|
77 |
+
latest_event_date = latest_fight['event_date']
|
78 |
+
|
79 |
+
# Check if we have new events since last training
|
80 |
+
if latest_event_name != last_event_name:
|
81 |
+
print(f"New data detected: Latest event '{latest_event_name}' differs from last trained event '{last_event_name}'")
|
82 |
+
return True
|
83 |
+
|
84 |
+
return False
|
85 |
|
86 |
+
def _model_exists(self, model):
|
87 |
+
"""Check if a saved model file exists and can be loaded successfully."""
|
88 |
+
model_name = model.__class__.__name__
|
89 |
+
file_name = f"{model_name}.joblib"
|
90 |
+
save_path = os.path.join(MODELS_DIR, file_name)
|
91 |
+
|
92 |
+
if not os.path.exists(save_path):
|
93 |
+
return False
|
94 |
+
|
95 |
+
# Verify the model can actually be loaded
|
96 |
+
try:
|
97 |
+
joblib.load(save_path)
|
98 |
+
return True
|
99 |
+
except Exception as e:
|
100 |
+
print(f"Warning: Model file {file_name} exists but cannot be loaded ({e}). Will retrain.")
|
101 |
+
return False
|
102 |
+
|
103 |
+
def _load_existing_model(self, model_class):
|
104 |
+
"""Load an existing model from disk."""
|
105 |
+
model_name = model_class.__name__
|
106 |
+
file_name = f"{model_name}.joblib"
|
107 |
+
load_path = os.path.join(MODELS_DIR, file_name)
|
108 |
+
|
109 |
+
try:
|
110 |
+
loaded_model = joblib.load(load_path)
|
111 |
+
print(f"Loaded existing model: {model_name}")
|
112 |
+
return loaded_model
|
113 |
+
except Exception as e:
|
114 |
+
print(f"Error loading model {model_name}: {e}")
|
115 |
+
return None
|
116 |
+
|
117 |
+
def _should_retrain_models(self):
|
118 |
+
"""Determine if models should be retrained."""
|
119 |
+
if self.force_retrain:
|
120 |
+
print("Force retrain flag is set. Retraining all models.")
|
121 |
+
return True
|
122 |
+
|
123 |
+
if not self.use_existing_models:
|
124 |
+
print("Use existing models flag is disabled. Retraining all models.")
|
125 |
+
return True
|
126 |
+
|
127 |
+
# Check if any model files are missing
|
128 |
+
missing_models = [m for m in self.models if not self._model_exists(m)]
|
129 |
+
if missing_models:
|
130 |
+
missing_names = [m.__class__.__name__ for m in missing_models]
|
131 |
+
print(f"Missing model files for: {missing_names}. Retraining all models.")
|
132 |
+
return True
|
133 |
+
|
134 |
+
# Check if there's new data since last training
|
135 |
+
if self._has_new_data_since_last_training():
|
136 |
+
return True
|
137 |
+
|
138 |
+
print("No new data detected and all model files exist. Using existing models.")
|
139 |
+
return False
|
140 |
+
|
141 |
+
def _load_and_split_data(self, num_test_events=1):
|
142 |
"""Loads and splits the data into chronological training and testing sets."""
|
143 |
print("\n--- Loading and Splitting Data ---")
|
144 |
if not os.path.exists(FIGHTS_CSV_PATH):
|
|
|
158 |
self.train_fights = [f for f in fights if f['event_name'] not in test_event_names]
|
159 |
self.test_fights = [f for f in fights if f['event_name'] in test_event_names]
|
160 |
print(f"Data loaded. {len(self.train_fights)} training fights, {len(self.test_fights)} testing fights.")
|
161 |
+
print(f"Testing on the last {num_test_events} event(s): {', '.join(test_event_names)}")
|
162 |
|
163 |
def run(self, detailed_report=True):
|
164 |
"""Executes the full pipeline: load, train, evaluate, report and save models."""
|
|
|
169 |
print("No fights with definitive outcomes in the test set. Aborting.")
|
170 |
return
|
171 |
|
172 |
+
should_retrain = self._should_retrain_models()
|
173 |
+
|
174 |
+
for i, model in enumerate(self.models):
|
175 |
model_name = model.__class__.__name__
|
176 |
print(f"\n--- Evaluating Model: {model_name} ---")
|
177 |
|
178 |
+
if should_retrain:
|
179 |
+
print(f"Training {model_name}...")
|
180 |
+
model.train(self.train_fights)
|
181 |
+
else:
|
182 |
+
# Try to load existing model, fall back to training if loading fails
|
183 |
+
loaded_model = self._load_existing_model(model.__class__)
|
184 |
+
if loaded_model is not None:
|
185 |
+
# Replace the model instance with the loaded one
|
186 |
+
self.models[i] = loaded_model
|
187 |
+
model = loaded_model
|
188 |
+
else:
|
189 |
+
print(f"Failed to load {model_name}, training new model...")
|
190 |
model.train(self.train_fights)
|
191 |
|
192 |
correct_predictions = 0
|
|
|
215 |
})
|
216 |
|
217 |
accuracy = (correct_predictions / len(eval_fights)) * 100
|
218 |
+
model_status = "retrained" if should_retrain else "loaded from disk"
|
219 |
self.results[model_name] = {
|
220 |
'accuracy': accuracy,
|
221 |
'predictions': predictions,
|
222 |
+
'total_fights': len(eval_fights),
|
223 |
+
'model_status': model_status
|
224 |
}
|
225 |
|
226 |
if detailed_report:
|
|
|
228 |
else:
|
229 |
self._report_summary()
|
230 |
|
231 |
+
# Only train and save models if retraining was performed
|
232 |
+
if should_retrain:
|
233 |
+
self._train_and_save_models()
|
234 |
|
235 |
def _train_and_save_models(self):
|
236 |
"""Trains all models on the full dataset and saves them."""
|
|
|
249 |
os.makedirs(MODELS_DIR)
|
250 |
print(f"Created directory: {MODELS_DIR}")
|
251 |
|
252 |
+
# Get the latest event info for tracking
|
253 |
+
if all_fights:
|
254 |
+
all_fights.sort(key=lambda x: datetime.strptime(x['event_date'], '%B %d, %Y'))
|
255 |
+
latest_fight = all_fights[-1]
|
256 |
+
latest_event_name = latest_fight['event_name']
|
257 |
+
latest_event_date = latest_fight['event_date']
|
258 |
+
|
259 |
for model in self.models:
|
260 |
model_name = model.__class__.__name__
|
261 |
print(f"\n--- Training: {model_name} ---")
|
|
|
267 |
joblib.dump(model, save_path)
|
268 |
print(f"Model saved successfully to {save_path}")
|
269 |
|
270 |
+
# Save the last trained event info
|
271 |
+
if all_fights:
|
272 |
+
self._save_last_trained_event(latest_event_name, latest_event_date)
|
273 |
+
print(f"Updated last trained event: {latest_event_name} ({latest_event_date})")
|
274 |
+
|
275 |
def _report_summary(self):
|
276 |
"""Prints a concise summary of model performance."""
|
277 |
print("\n\n--- Prediction Pipeline Summary ---")
|
278 |
+
print(f"{'Model':<25} | {'Accuracy':<10} | {'Fights Evaluated':<20} | {'Status':<15}")
|
279 |
+
print("-" * 80)
|
280 |
for model_name, result in self.results.items():
|
281 |
+
status = result.get('model_status', 'unknown')
|
282 |
+
print(f"{model_name:<25} | {result['accuracy']:<9.2f}% | {result['total_fights']:<20} | {status:<15}")
|
283 |
+
print("-" * 80)
|
284 |
|
285 |
def _save_report_to_json(self, file_path=MODEL_RESULTS_PATH):
|
286 |
"""Saves the detailed prediction results to a JSON file."""
|
|
|
301 |
report[model_name] = {
|
302 |
"overall_accuracy": f"{result['accuracy']:.2f}%",
|
303 |
"total_fights_evaluated": result['total_fights'],
|
304 |
+
"model_status": result.get('model_status', 'unknown'),
|
305 |
"predictions_by_event": predictions_by_event
|
306 |
}
|
307 |
|
src/predict/predict_new.py
CHANGED
@@ -3,7 +3,12 @@ import os
|
|
3 |
import joblib
|
4 |
from datetime import datetime
|
5 |
|
6 |
-
|
|
|
|
|
|
|
|
|
|
|
7 |
|
8 |
def predict_new_fight(fighter1_name, fighter2_name, model_path):
|
9 |
"""
|
|
|
3 |
import joblib
|
4 |
from datetime import datetime
|
5 |
|
6 |
+
# Use absolute imports to avoid relative import issues
|
7 |
+
try:
|
8 |
+
from src.config import MODELS_DIR
|
9 |
+
except ImportError:
|
10 |
+
# Fallback for when running directly
|
11 |
+
from ..config import MODELS_DIR
|
12 |
|
13 |
def predict_new_fight(fighter1_name, fighter2_name, model_path):
|
14 |
"""
|
src/predict/preprocess.py
CHANGED
@@ -2,6 +2,12 @@ import pandas as pd
|
|
2 |
import os
|
3 |
import sys
|
4 |
from datetime import datetime
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
from ..config import FIGHTERS_CSV_PATH
|
6 |
|
7 |
def _clean_numeric_column(series):
|
@@ -232,6 +238,11 @@ def preprocess_for_ml(fights_to_process, fighters_csv_path):
|
|
232 |
return X, y, metadata
|
233 |
|
234 |
if __name__ == '__main__':
|
|
|
|
|
|
|
|
|
|
|
235 |
from .pipeline import PredictionPipeline
|
236 |
|
237 |
print("--- Running Preprocessing Example ---")
|
|
|
2 |
import os
|
3 |
import sys
|
4 |
from datetime import datetime
|
5 |
+
|
6 |
+
# Use absolute imports to avoid relative import issues
|
7 |
+
try:
|
8 |
+
from src.config import FIGHTERS_CSV_PATH
|
9 |
+
except ImportError:
|
10 |
+
# Fallback for when running directly
|
11 |
from ..config import FIGHTERS_CSV_PATH
|
12 |
|
13 |
def _clean_numeric_column(series):
|
|
|
238 |
return X, y, metadata
|
239 |
|
240 |
if __name__ == '__main__':
|
241 |
+
# Use absolute imports to avoid relative import issues
|
242 |
+
try:
|
243 |
+
from src.predict.pipeline import PredictionPipeline
|
244 |
+
except ImportError:
|
245 |
+
# Fallback for when running directly
|
246 |
from .pipeline import PredictionPipeline
|
247 |
|
248 |
print("--- Running Preprocessing Example ---")
|
src/scrape/main.py
CHANGED
@@ -1,6 +1,8 @@
|
|
1 |
import os
|
2 |
import json
|
3 |
-
|
|
|
|
|
4 |
from .scrape_fighters import scrape_all_fighters
|
5 |
from .to_csv import json_to_csv, fighters_json_to_csv
|
6 |
from .preprocess import preprocess_fighters_csv
|
@@ -8,17 +10,46 @@ from .. import config
|
|
8 |
|
9 |
def main():
|
10 |
"""
|
11 |
-
Main function to run the
|
|
|
12 |
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
# Ensure the output directory exists
|
14 |
if not os.path.exists(config.OUTPUT_DIR):
|
15 |
os.makedirs(config.OUTPUT_DIR)
|
16 |
print(f"Created directory: {config.OUTPUT_DIR}")
|
17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
# --- Step 1: Scrape all data from the website ---
|
19 |
# This will generate fighters.json and events.json
|
20 |
-
scrape_all_fighters()
|
21 |
-
scrape_all_events()
|
22 |
|
23 |
# --- Step 2: Convert the scraped JSON data to CSV format ---
|
24 |
# This will generate fighters.csv and fights.csv
|
@@ -42,7 +73,133 @@ def main():
|
|
42 |
except OSError as e:
|
43 |
print(f"Error deleting JSON files: {e}")
|
44 |
|
45 |
-
print("\n\n--- Scraping and Preprocessing Pipeline Finished ---")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
|
47 |
if __name__ == '__main__':
|
48 |
main()
|
|
|
1 |
import os
|
2 |
import json
|
3 |
+
import argparse
|
4 |
+
import pandas as pd
|
5 |
+
from .scrape_fights import scrape_all_events, scrape_latest_events
|
6 |
from .scrape_fighters import scrape_all_fighters
|
7 |
from .to_csv import json_to_csv, fighters_json_to_csv
|
8 |
from .preprocess import preprocess_fighters_csv
|
|
|
10 |
|
11 |
def main():
|
12 |
"""
|
13 |
+
Main function to run the scraping and preprocessing pipeline.
|
14 |
+
Supports both full scraping and incremental updates.
|
15 |
"""
|
16 |
+
parser = argparse.ArgumentParser(description="UFC Data Scraping Pipeline")
|
17 |
+
parser.add_argument(
|
18 |
+
'--mode',
|
19 |
+
type=str,
|
20 |
+
default='full',
|
21 |
+
choices=['full', 'update'],
|
22 |
+
help="Scraping mode: 'full' (complete scraping) or 'update' (latest events + sync from last_event.json)"
|
23 |
+
)
|
24 |
+
parser.add_argument(
|
25 |
+
'--num-events',
|
26 |
+
type=int,
|
27 |
+
default=5,
|
28 |
+
help="Number of latest events to scrape in update mode (default: 5)"
|
29 |
+
)
|
30 |
+
|
31 |
+
args = parser.parse_args()
|
32 |
+
|
33 |
# Ensure the output directory exists
|
34 |
if not os.path.exists(config.OUTPUT_DIR):
|
35 |
os.makedirs(config.OUTPUT_DIR)
|
36 |
print(f"Created directory: {config.OUTPUT_DIR}")
|
37 |
|
38 |
+
if args.mode == 'full':
|
39 |
+
run_full_pipeline()
|
40 |
+
elif args.mode == 'update':
|
41 |
+
run_update_pipeline(args.num_events)
|
42 |
+
|
43 |
+
def run_full_pipeline():
|
44 |
+
"""
|
45 |
+
Runs the complete scraping and preprocessing pipeline.
|
46 |
+
"""
|
47 |
+
print("\n=== Running FULL scraping pipeline ===")
|
48 |
+
|
49 |
# --- Step 1: Scrape all data from the website ---
|
50 |
# This will generate fighters.json and events.json
|
51 |
+
scrape_all_fighters(config.FIGHTERS_JSON_PATH)
|
52 |
+
scrape_all_events(config.EVENTS_JSON_PATH)
|
53 |
|
54 |
# --- Step 2: Convert the scraped JSON data to CSV format ---
|
55 |
# This will generate fighters.csv and fights.csv
|
|
|
73 |
except OSError as e:
|
74 |
print(f"Error deleting JSON files: {e}")
|
75 |
|
76 |
+
print("\n\n--- Full Scraping and Preprocessing Pipeline Finished ---")
|
77 |
+
|
78 |
+
def run_update_pipeline(num_events=5):
|
79 |
+
"""
|
80 |
+
Runs the incremental update pipeline to scrape only the latest events.
|
81 |
+
Also adds any events from last_event.json that aren't already in the CSV.
|
82 |
+
|
83 |
+
Args:
|
84 |
+
num_events (int): Number of latest events to scrape
|
85 |
+
"""
|
86 |
+
print(f"\n=== Running UPDATE pipeline for latest {num_events} events ===")
|
87 |
+
|
88 |
+
# --- Step 1: Scrape latest events only ---
|
89 |
+
latest_events = scrape_latest_events(config.LAST_EVENT_JSON_PATH, num_events)
|
90 |
+
|
91 |
+
# --- Step 2: Save latest events to last_event.json (even if empty) ---
|
92 |
+
if latest_events:
|
93 |
+
with open(config.LAST_EVENT_JSON_PATH, 'w') as f:
|
94 |
+
json.dump(latest_events, f, indent=4)
|
95 |
+
print(f"Latest {len(latest_events)} events saved to {config.LAST_EVENT_JSON_PATH}")
|
96 |
+
|
97 |
+
# --- Step 3: Always check and update from last_event.json ---
|
98 |
+
update_fights_csv_from_last_event()
|
99 |
+
|
100 |
+
print(f"\n--- Update Pipeline Finished ---")
|
101 |
+
|
102 |
+
def update_fights_csv_from_last_event():
|
103 |
+
"""
|
104 |
+
Updates the existing fights CSV with any events from last_event.json that aren't already present.
|
105 |
+
Ensures latest events are on top and preserves data types.
|
106 |
+
"""
|
107 |
+
# Check if last_event.json exists
|
108 |
+
if not os.path.exists(config.LAST_EVENT_JSON_PATH):
|
109 |
+
print(f"No {config.LAST_EVENT_JSON_PATH} found. Nothing to update.")
|
110 |
+
return
|
111 |
+
|
112 |
+
# Load events from last_event.json
|
113 |
+
try:
|
114 |
+
with open(config.LAST_EVENT_JSON_PATH, 'r') as f:
|
115 |
+
events_from_json = json.load(f)
|
116 |
+
|
117 |
+
if not events_from_json:
|
118 |
+
print("No events found in last_event.json.")
|
119 |
+
return
|
120 |
+
|
121 |
+
print(f"Found {len(events_from_json)} events in last_event.json")
|
122 |
+
|
123 |
+
except Exception as e:
|
124 |
+
print(f"Error reading last_event.json: {e}")
|
125 |
+
return
|
126 |
+
|
127 |
+
try:
|
128 |
+
# Check if main CSV exists
|
129 |
+
if os.path.exists(config.FIGHTS_CSV_PATH):
|
130 |
+
existing_df = pd.read_csv(config.FIGHTS_CSV_PATH)
|
131 |
+
existing_event_names = set(existing_df['event_name'].unique())
|
132 |
+
else:
|
133 |
+
print(f"Main fights CSV ({config.FIGHTS_CSV_PATH}) not found. Creating new CSV from last_event.json.")
|
134 |
+
json_to_csv(config.LAST_EVENT_JSON_PATH, config.FIGHTS_CSV_PATH)
|
135 |
+
return
|
136 |
+
|
137 |
+
# Create temporary CSV from events in last_event.json
|
138 |
+
temp_json_path = os.path.join(config.OUTPUT_DIR, 'temp_latest.json')
|
139 |
+
temp_csv_path = os.path.join(config.OUTPUT_DIR, 'temp_latest.csv')
|
140 |
+
|
141 |
+
with open(temp_json_path, 'w') as f:
|
142 |
+
json.dump(events_from_json, f, indent=4)
|
143 |
+
|
144 |
+
json_to_csv(temp_json_path, temp_csv_path)
|
145 |
+
|
146 |
+
# Read the new CSV
|
147 |
+
new_df = pd.read_csv(temp_csv_path)
|
148 |
+
|
149 |
+
# Filter out events that already exist
|
150 |
+
new_events_df = new_df[~new_df['event_name'].isin(existing_event_names)]
|
151 |
+
|
152 |
+
if len(new_events_df) > 0:
|
153 |
+
# Add new events to the TOP of the CSV (latest first)
|
154 |
+
combined_df = pd.concat([new_events_df, existing_df], ignore_index=True)
|
155 |
+
|
156 |
+
# Convert date column to datetime for proper sorting
|
157 |
+
combined_df['event_date_parsed'] = pd.to_datetime(combined_df['event_date'])
|
158 |
+
|
159 |
+
# Sort by date descending (latest first)
|
160 |
+
combined_df = combined_df.sort_values('event_date_parsed', ascending=False)
|
161 |
+
|
162 |
+
# Drop the temporary date column
|
163 |
+
combined_df = combined_df.drop('event_date_parsed', axis=1)
|
164 |
+
|
165 |
+
# Fix data types to remove .0 from numbers
|
166 |
+
fix_data_types(combined_df)
|
167 |
+
|
168 |
+
combined_df.to_csv(config.FIGHTS_CSV_PATH, index=False)
|
169 |
+
print(f"Added {len(new_events_df)} new fights from {new_events_df['event_name'].nunique()} events to the TOP of {config.FIGHTS_CSV_PATH}")
|
170 |
+
else:
|
171 |
+
print("No new events found that aren't already in the existing CSV.")
|
172 |
+
|
173 |
+
# Clean up temporary files
|
174 |
+
if os.path.exists(temp_json_path):
|
175 |
+
os.remove(temp_json_path)
|
176 |
+
if os.path.exists(temp_csv_path):
|
177 |
+
os.remove(temp_csv_path)
|
178 |
+
|
179 |
+
except Exception as e:
|
180 |
+
print(f"Error updating fights CSV: {e}")
|
181 |
+
print("Falling back to creating new CSV from last_event.json only.")
|
182 |
+
json_to_csv(config.LAST_EVENT_JSON_PATH, config.FIGHTS_CSV_PATH)
|
183 |
+
|
184 |
+
def fix_data_types(df):
|
185 |
+
"""
|
186 |
+
Fix data types in the dataframe to remove .0 from numbers and preserve original format.
|
187 |
+
|
188 |
+
Args:
|
189 |
+
df (pandas.DataFrame): DataFrame to fix
|
190 |
+
"""
|
191 |
+
for col in df.columns:
|
192 |
+
if df[col].dtype == 'float64':
|
193 |
+
# Check if the column contains only whole numbers (no actual decimals)
|
194 |
+
if df[col].notna().all() and (df[col] % 1 == 0).all():
|
195 |
+
df[col] = df[col].astype('int64')
|
196 |
+
elif df[col].isna().any():
|
197 |
+
# Handle columns with missing values - keep as string to avoid .0
|
198 |
+
df[col] = df[col].fillna('').astype(str)
|
199 |
+
# Remove .0 from string representations
|
200 |
+
df[col] = df[col].str.replace(r'\.0$', '', regex=True)
|
201 |
+
# Convert empty strings back to original empty values
|
202 |
+
df[col] = df[col].replace('', '')
|
203 |
|
204 |
if __name__ == '__main__':
|
205 |
main()
|
src/scrape/scrape_fighters.py
CHANGED
@@ -68,7 +68,7 @@ def process_fighter(fighter_data):
|
|
68 |
time.sleep(REQUEST_DELAY)
|
69 |
return fighter_data
|
70 |
|
71 |
-
def scrape_all_fighters():
|
72 |
"""Scrapes all fighters from a-z pages using parallel processing."""
|
73 |
|
74 |
# Step 1: Sequentially scrape all fighter list pages. This is fast.
|
@@ -129,14 +129,14 @@ def scrape_all_fighters():
|
|
129 |
|
130 |
if (i + 1) > 0 and (i + 1) % 50 == 0:
|
131 |
fighters_with_details.sort(key=lambda x: (x['last_name'], x['first_name']))
|
132 |
-
with open(
|
133 |
json.dump(fighters_with_details, f, indent=4)
|
134 |
|
135 |
fighters_with_details.sort(key=lambda x: (x['last_name'], x['first_name']))
|
136 |
return fighters_with_details
|
137 |
|
138 |
if __name__ == "__main__":
|
139 |
-
all_fighters_data = scrape_all_fighters()
|
140 |
if not os.path.exists(config.OUTPUT_DIR):
|
141 |
os.makedirs(config.OUTPUT_DIR)
|
142 |
|
|
|
68 |
time.sleep(REQUEST_DELAY)
|
69 |
return fighter_data
|
70 |
|
71 |
+
def scrape_all_fighters(json_path):
|
72 |
"""Scrapes all fighters from a-z pages using parallel processing."""
|
73 |
|
74 |
# Step 1: Sequentially scrape all fighter list pages. This is fast.
|
|
|
129 |
|
130 |
if (i + 1) > 0 and (i + 1) % 50 == 0:
|
131 |
fighters_with_details.sort(key=lambda x: (x['last_name'], x['first_name']))
|
132 |
+
with open(json_path, 'w') as f:
|
133 |
json.dump(fighters_with_details, f, indent=4)
|
134 |
|
135 |
fighters_with_details.sort(key=lambda x: (x['last_name'], x['first_name']))
|
136 |
return fighters_with_details
|
137 |
|
138 |
if __name__ == "__main__":
|
139 |
+
all_fighters_data = scrape_all_fighters(config.FIGHTERS_JSON_PATH)
|
140 |
if not os.path.exists(config.OUTPUT_DIR):
|
141 |
os.makedirs(config.OUTPUT_DIR)
|
142 |
|
src/scrape/scrape_fights.py
CHANGED
@@ -3,7 +3,7 @@ from bs4 import BeautifulSoup
|
|
3 |
import json
|
4 |
import time
|
5 |
import concurrent.futures
|
6 |
-
from ..
|
7 |
|
8 |
# --- Configuration ---
|
9 |
# The number of parallel threads to use for scraping fight details.
|
@@ -175,7 +175,7 @@ def scrape_event_details(event_url):
|
|
175 |
event_details['fights'] = completed_fights
|
176 |
return event_details
|
177 |
|
178 |
-
def scrape_all_events():
|
179 |
soup = get_soup(BASE_URL)
|
180 |
events = []
|
181 |
|
@@ -204,15 +204,60 @@ def scrape_all_events():
|
|
204 |
|
205 |
if (i + 1) % 10 == 0:
|
206 |
print(f"--- Saving progress: {i + 1} of {total_events} events saved. ---")
|
207 |
-
with open(
|
208 |
json.dump(events, f, indent=4)
|
209 |
except Exception as e:
|
210 |
print(f"Could not process event {event_url}. Error: {e}")
|
211 |
|
212 |
return events
|
213 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
214 |
if __name__ == "__main__":
|
215 |
-
all_events_data = scrape_all_events()
|
216 |
-
with open(EVENTS_JSON_PATH, 'w') as f:
|
217 |
json.dump(all_events_data, f, indent=4)
|
218 |
-
print(f"\nScraping complete. Final data saved to {EVENTS_JSON_PATH}")
|
|
|
3 |
import json
|
4 |
import time
|
5 |
import concurrent.futures
|
6 |
+
from .. import config
|
7 |
|
8 |
# --- Configuration ---
|
9 |
# The number of parallel threads to use for scraping fight details.
|
|
|
175 |
event_details['fights'] = completed_fights
|
176 |
return event_details
|
177 |
|
178 |
+
def scrape_all_events(json_path):
|
179 |
soup = get_soup(BASE_URL)
|
180 |
events = []
|
181 |
|
|
|
204 |
|
205 |
if (i + 1) % 10 == 0:
|
206 |
print(f"--- Saving progress: {i + 1} of {total_events} events saved. ---")
|
207 |
+
with open(json_path, 'w') as f:
|
208 |
json.dump(events, f, indent=4)
|
209 |
except Exception as e:
|
210 |
print(f"Could not process event {event_url}. Error: {e}")
|
211 |
|
212 |
return events
|
213 |
|
214 |
+
def scrape_latest_events(json_path, num_events=5):
|
215 |
+
"""
|
216 |
+
Scrapes only the latest N events from UFC stats.
|
217 |
+
This is useful for incremental updates to avoid re-scraping all data.
|
218 |
+
|
219 |
+
Args:
|
220 |
+
json_path (str): Path to save the latest events JSON file
|
221 |
+
num_events (int): Number of latest events to scrape (default: 5)
|
222 |
+
|
223 |
+
Returns:
|
224 |
+
list: List of scraped event data
|
225 |
+
"""
|
226 |
+
soup = get_soup(BASE_URL)
|
227 |
+
events = []
|
228 |
+
|
229 |
+
table = soup.find('table', class_='b-statistics__table-events')
|
230 |
+
if not table:
|
231 |
+
print("Could not find events table on the page.")
|
232 |
+
return []
|
233 |
+
|
234 |
+
event_rows = [row for row in table.find_all('tr', class_='b-statistics__table-row') if row.find('td')]
|
235 |
+
|
236 |
+
# Limit to the latest N events (events are ordered chronologically with most recent first)
|
237 |
+
latest_event_rows = event_rows[:num_events]
|
238 |
+
total_events = len(latest_event_rows)
|
239 |
+
print(f"Found {len(event_rows)} total events. Scraping latest {total_events} events.")
|
240 |
+
|
241 |
+
for i, row in enumerate(latest_event_rows):
|
242 |
+
event_link_tag = row.find('a', class_='b-link b-link_style_black')
|
243 |
+
if not event_link_tag or not event_link_tag.has_attr('href'):
|
244 |
+
continue
|
245 |
+
|
246 |
+
event_url = event_link_tag['href']
|
247 |
+
|
248 |
+
try:
|
249 |
+
event_data = scrape_event_details(event_url)
|
250 |
+
if event_data:
|
251 |
+
events.append(event_data)
|
252 |
+
|
253 |
+
print(f"Progress: {i+1}/{total_events} latest events scraped.")
|
254 |
+
except Exception as e:
|
255 |
+
print(f"Could not process event {event_url}. Error: {e}")
|
256 |
+
|
257 |
+
return events
|
258 |
+
|
259 |
if __name__ == "__main__":
|
260 |
+
all_events_data = scrape_all_events(config.EVENTS_JSON_PATH)
|
261 |
+
with open(config.EVENTS_JSON_PATH, 'w') as f:
|
262 |
json.dump(all_events_data, f, indent=4)
|
263 |
+
print(f"\nScraping complete. Final data saved to {config.EVENTS_JSON_PATH}")
|