|
--- |
|
license: mit |
|
datasets: |
|
- Nnaodeh/Stroke_Prediction_Dataset |
|
language: |
|
- en |
|
pipeline_tag: tabular-classification |
|
--- |
|
|
|
# Stroke Prediction Model |
|
|
|
This project implements a machine learning pipeline for predicting stroke risk using tabular data from the patient dataset. Multiple models are trained to choose the best performing. Below is a detailed explanation of how each key consideration was implemented. |
|
|
|
### Data Set |
|
|
|
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient. |
|
|
|
### Attribute Information |
|
|
|
1. id: unique identifier |
|
2. gender: "Male", "Female" or "Other" |
|
3. age: age of the patient |
|
4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension |
|
5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease |
|
6. ever_married: "No" or "Yes" |
|
7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed" |
|
8. Residence_type: "Rural" or "Urban" |
|
9. avg_glucose_level: average glucose level in blood |
|
10. bmi: body mass index |
|
11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"\* |
|
12. stroke: 1 if the patient had a stroke or 0 if not |
|
|
|
## Key Considerations Implementation |
|
|
|
## Data Cleaning |
|
|
|
#### Drop id column |
|
|
|
The id column is dropped as it serves as a unique identifier for each row but does not contribute to the predictive power of the model. |
|
|
|
#### Remove missing values |
|
|
|
Remove data entries with missing 'bmi' as it corresponds no impact to model accuracy being less in number |
|
|
|
## Feature Engineering |
|
|
|
#### Binary Encoding |
|
|
|
Convert categorical features with only two unique values into binary numeric format for easier processing by machine learning models: |
|
|
|
- ever_married: Encoded as 0 for “No” and 1 for “Yes”. |
|
- Residence_type: Encoded as 0 for “Rural” and 1 for “Urban”. |
|
|
|
#### One-Hot Encoding for Multi-Class Categorical Features |
|
|
|
- For features with more than two categories, such as gender, work_type, and smoking_status, apply one-hot encoding to create separate binary columns for each category. |
|
- The onehot_encode function is assumed to handle the transformation, creating additional columns for each category while dropping the original column. |
|
|
|
#### Split Dataset into Features and Target |
|
|
|
- Separate the target variable (stroke) from the features: |
|
- X: Contains all feature columns used as input for the model. |
|
- y: Contains the target column, which indicates whether a stroke occurred. |
|
|
|
#### Train-Test Split |
|
|
|
- Split the dataset into training and testing sets to evaluate model performance effectively. This ensures the model is tested on unseen data and helps prevent overfitting. |
|
- The specific split ratio (e.g., 70% train, 30% test) can be customized as needed. |
|
|
|
### Model Selection |
|
|
|
Following models are evaluated: |
|
|
|
- Logistic Regression |
|
- K-Nearest Neighbors |
|
- Support Vector Machine (Linear Kernel) |
|
- Support Vector Machine (RBF Kernel) |
|
- Neural Network |
|
- Gradient Boosting |
|
|
|
Evaluated for: |
|
|
|
- Handles both numerical and categorical features |
|
- Resistant to overfitting |
|
- Provides feature importance |
|
- Good performance on imbalanced data |
|
|
|
### 4. Software Engineering Best Practices |
|
|
|
#### A. Logging |
|
|
|
Comprehensive logging system: |
|
|
|
```python |
|
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') |
|
``` |
|
|
|
Logging features: |
|
|
|
- Timestamp for each operation |
|
- Different log levels (INFO, ERROR) |
|
- Operation tracking |
|
- Error capture and reporting |
|
|
|
#### B. Documentation |
|
|
|
- Docstrings for all classes and methods |
|
- Clear code structure with comments |
|
- This README file |
|
- Logging outputs for tracking |
|
|