TestModel / README.md

Update README.md

89ac66e verified 6 months ago

3.77 kB

	---
	license: mit
	datasets:
	- Nnaodeh/Stroke_Prediction_Dataset
	language:
	- en
	pipeline_tag: tabular-classification
	---

	# Stroke Prediction Model

	This project implements a machine learning pipeline for predicting stroke risk using tabular data from the patient dataset. Multiple models are trained to choose the best performing. Below is a detailed explanation of how each key consideration was implemented.

	### Data Set

	This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

	### Attribute Information

	1. id: unique identifier
	2. gender: "Male", "Female" or "Other"
	3. age: age of the patient
	4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
	5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
	6. ever_married: "No" or "Yes"
	7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
	8. Residence_type: "Rural" or "Urban"
	9. avg_glucose_level: average glucose level in blood
	10. bmi: body mass index
	11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"\*
	12. stroke: 1 if the patient had a stroke or 0 if not

	## Key Considerations Implementation

	## Data Cleaning

	#### Drop id column

	The id column is dropped as it serves as a unique identifier for each row but does not contribute to the predictive power of the model.

	#### Remove missing values

	Remove data entries with missing 'bmi' as it corresponds no impact to model accuracy being less in number

	## Feature Engineering

	#### Binary Encoding

	Convert categorical features with only two unique values into binary numeric format for easier processing by machine learning models:

	- ever_married: Encoded as 0 for “No” and 1 for “Yes”.
	- Residence_type: Encoded as 0 for “Rural” and 1 for “Urban”.

	#### One-Hot Encoding for Multi-Class Categorical Features

	- For features with more than two categories, such as gender, work_type, and smoking_status, apply one-hot encoding to create separate binary columns for each category.
	- The onehot_encode function is assumed to handle the transformation, creating additional columns for each category while dropping the original column.

	#### Split Dataset into Features and Target

	- Separate the target variable (stroke) from the features:
	- X: Contains all feature columns used as input for the model.
	- y: Contains the target column, which indicates whether a stroke occurred.

	#### Train-Test Split

	- Split the dataset into training and testing sets to evaluate model performance effectively. This ensures the model is tested on unseen data and helps prevent overfitting.
	- The specific split ratio (e.g., 70% train, 30% test) can be customized as needed.

	### Model Selection

	Following models are evaluated:

	- Logistic Regression
	- K-Nearest Neighbors
	- Support Vector Machine (Linear Kernel)
	- Support Vector Machine (RBF Kernel)
	- Neural Network
	- Gradient Boosting

	Evaluated for:

	- Handles both numerical and categorical features
	- Resistant to overfitting
	- Provides feature importance
	- Good performance on imbalanced data

	### 4. Software Engineering Best Practices

	#### A. Logging

	Comprehensive logging system:

	```python
	logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
	```

	Logging features:

	- Timestamp for each operation
	- Different log levels (INFO, ERROR)
	- Operation tracking
	- Error capture and reporting

	#### B. Documentation

	- Docstrings for all classes and methods
	- Clear code structure with comments
	- This README file
	- Logging outputs for tracking