nullHawk commited on
Commit
85d5c46
Β·
1 Parent(s): 58a93cd
Files changed (1) hide show
  1. EDA_Documentation.md +441 -0
EDA_Documentation.md ADDED
@@ -0,0 +1,441 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Loan Prediction EDA Documentation
2
+
3
+ ## Executive Summary
4
+
5
+ This document provides a comprehensive overview of the Exploratory Data Analysis (EDA) and Feature Engineering process performed on the Lending Club loan dataset for training an Artificial Neural Network (ANN) to predict loan repayment outcomes.
6
+
7
+ **Dataset**: Lending Club Loan Data
8
+ **Original Size**: 396,030 records Γ— 27 features
9
+ **Final Processed Size**: 396,030 records Γ— 9 features
10
+ **Target Variable**: Loan repayment status (binary classification)
11
+ **Date**: June 2025
12
+
13
+ ---
14
+
15
+ ## Table of Contents
16
+
17
+ 1. [Data Overview](#data-overview)
18
+ 2. [Initial Data Exploration](#initial-data-exploration)
19
+ 3. [Missing Data Analysis](#missing-data-analysis)
20
+ 4. [Target Variable Analysis](#target-variable-analysis)
21
+ 5. [Feature Correlation Analysis](#feature-correlation-analysis)
22
+ 6. [Categorical Feature Analysis](#categorical-feature-analysis)
23
+ 7. [Feature Engineering](#feature-engineering)
24
+ 8. [Feature Selection](#feature-selection)
25
+ 9. [Data Preprocessing for ANN](#data-preprocessing-for-ann)
26
+ 10. [Final Dataset Summary](#final-dataset-summary)
27
+
28
+ ---
29
+
30
+ ## 1. Data Overview
31
+
32
+ ### Initial Dataset Structure
33
+ - **Shape**: 396,030 rows Γ— 27 columns
34
+ - **Target Variable**: `loan_status` (Fully Paid vs Charged Off)
35
+ - **Feature Types**: Mix of numerical and categorical variables
36
+ - **Domain**: Peer-to-peer lending data from Lending Club
37
+
38
+ ### Key Business Context
39
+ The goal is to predict whether a borrower will fully repay their loan or default (charge off). This is a critical business problem for lenders as it directly impacts:
40
+ - Risk assessment
41
+ - Interest rate pricing
42
+ - Portfolio management
43
+ - Regulatory compliance
44
+
45
+ ---
46
+
47
+ ## 2. Initial Data Exploration
48
+
49
+ ### Why This Step Was Performed
50
+ Understanding the basic structure and characteristics of the dataset is crucial before any analysis. This helps identify:
51
+ - Data quality issues
52
+ - Feature types and distributions
53
+ - Potential preprocessing needs
54
+
55
+ ### Actions Taken
56
+ ```python
57
+ # Basic exploration commands used:
58
+ df.shape # Dataset dimensions
59
+ df.info() # Data types and memory usage
60
+ df.describe() # Statistical summary for numerical features
61
+ df.columns # Feature names
62
+ ```
63
+
64
+ ### Key Findings
65
+ - 396,030 loan records spanning multiple years
66
+ - Mix of numerical (interest rates, amounts, ratios) and categorical (grades, purposes) features
67
+ - Presence of date features requiring special handling
68
+ - Some features with high cardinality (e.g., employment titles)
69
+
70
+ ---
71
+
72
+ ## 3. Missing Data Analysis
73
+
74
+ ### Why This Step Was Critical
75
+ Missing data can significantly impact model performance and introduce bias. For neural networks, complete data is especially important for stable training.
76
+
77
+ ### Methodology
78
+ 1. **Quantified missing values** for each feature
79
+ 2. **Visualized missing patterns** using heatmap
80
+ 3. **Applied strategic removal and imputation**
81
+
82
+ ### Actions Taken
83
+ ```python
84
+ # Missing data analysis
85
+ df.isnull().sum().sort_values(ascending=False)
86
+ sns.heatmap(df.isnull(), cbar=False) # Visual pattern analysis
87
+ ```
88
+
89
+ ### Decisions Made
90
+ 1. **Removed high-missing features**:
91
+ - `mort_acc` (mortgage accounts)
92
+ - `emp_title` (employment titles - too many unique values)
93
+ - `emp_length` (employment length - high missingness)
94
+ - `title` (loan titles - redundant with purpose)
95
+
96
+ 2. **Imputation strategy**:
97
+ - **Numerical features**: Median imputation (robust to outliers)
98
+ - **Categorical features**: Mode imputation (most frequent category)
99
+
100
+ ### Rationale
101
+ - Features with >50% missing data were dropped to avoid introducing too much imputed noise
102
+ - Median imputation chosen over mean for numerical features due to potential skewness in financial data
103
+ - Mode imputation maintains the natural distribution of categorical variables
104
+
105
+ ---
106
+
107
+ ## 4. Target Variable Analysis
108
+
109
+ ### Why This Analysis Was Essential
110
+ Understanding target distribution is crucial for:
111
+ - Identifying class imbalance
112
+ - Choosing appropriate evaluation metrics
113
+ - Determining if sampling techniques are needed
114
+
115
+ ### Findings
116
+ - **Fully Paid**: 318,357 loans (80.4%)
117
+ - **Charged Off**: 77,673 loans (19.6%)
118
+ - **Class Ratio**: ~4:1 (moderate imbalance)
119
+
120
+ ### Target Engineering Decision
121
+ Created binary target variable `loan_repaid`:
122
+ - **1**: Fully Paid (positive outcome)
123
+ - **0**: Charged Off (negative outcome)
124
+
125
+ ### Impact on Modeling
126
+ The 80/20 split represents a moderate class imbalance that's manageable for neural networks without requiring aggressive resampling techniques.
127
+
128
+ ---
129
+
130
+ ## 5. Feature Correlation Analysis
131
+
132
+ ### Purpose
133
+ Identify relationships between numerical features and the target variable to:
134
+ - Understand predictive power of individual features
135
+ - Detect potential multicollinearity issues
136
+ - Guide feature selection priorities
137
+
138
+ ### Methodology
139
+ ```python
140
+ # Correlation analysis with target
141
+ correlation_with_target = df[numerical_features + ['loan_repaid']].corr()['loan_repaid']
142
+ ```
143
+
144
+ ### Key Discoveries
145
+ **Top Predictive Features** (by correlation magnitude):
146
+ 1. `revol_util` (-0.082): Higher revolving credit utilization = higher default risk
147
+ 2. `dti` (-0.062): Higher debt-to-income ratio = higher default risk
148
+ 3. `loan_amnt` (-0.060): Larger loans = higher default risk
149
+ 4. `annual_inc` (+0.053): Higher income = lower default risk
150
+
151
+ ### Business Insights
152
+ - **Credit utilization** emerged as the strongest single predictor
153
+ - **Debt ratios** consistently showed negative correlation with repayment
154
+ - **Income level** showed positive correlation with successful repayment
155
+ - Correlations were relatively weak, suggesting need for feature engineering
156
+
157
+ ---
158
+
159
+ ## 6. Categorical Feature Analysis
160
+
161
+ ### Objective
162
+ Understand how categorical variables relate to loan outcomes and identify high-impact categories.
163
+
164
+ ### Features Analyzed
165
+ - `grade`: Lending Club's risk assessment (A-G)
166
+ - `home_ownership`: Housing status
167
+ - `verification_status`: Income verification level
168
+ - `purpose`: Loan purpose
169
+ - `initial_list_status`: Initial listing status
170
+ - `application_type`: Individual vs joint application
171
+
172
+ ### Key Findings
173
+
174
+ #### Grade Analysis
175
+ - **Grade A**: ~95% repayment rate (highest quality)
176
+ - **Grade G**: ~52% repayment rate (highest risk)
177
+ - Clear monotonic relationship between grade and repayment rate
178
+
179
+ #### Home Ownership
180
+ - **Any/Other**: Highest repayment rates (~100%)
181
+ - **Rent**: Lowest repayment rates (~78%)
182
+ - **Own/Mortgage**: Middle performance (~80-83%)
183
+
184
+ #### Purpose Analysis
185
+ - **Wedding**: Highest repayment rate (~88%)
186
+ - **Small Business**: Lowest repayment rate (~71%)
187
+ - **Debt Consolidation**: Most common purpose with ~80% repayment
188
+
189
+ ### Business Implications
190
+ - Lending Club's internal grading system is highly predictive
191
+ - Housing stability correlates with loan performance
192
+ - Loan purpose provides significant risk differentiation
193
+
194
+ ---
195
+
196
+ ## 7. Feature Engineering
197
+
198
+ ### Strategic Approach
199
+ Created new features to capture complex relationships and domain knowledge that raw features might miss.
200
+
201
+ ### New Features Created
202
+
203
+ #### Date-Based Features
204
+ ```python
205
+ df['credit_history_length'] = (df['issue_d'] - df['earliest_cr_line']).dt.days / 365.25
206
+ df['issue_year'] = df['issue_d'].dt.year
207
+ df['issue_month'] = df['issue_d'].dt.month
208
+ ```
209
+ **Rationale**: Credit history length is a key risk factor in traditional credit scoring.
210
+
211
+ #### Financial Ratio Features
212
+ ```python
213
+ df['debt_to_credit_ratio'] = df['revol_bal'] / (df['revol_bal'] + df['annual_inc'] + 1)
214
+ df['loan_to_income_ratio'] = df['loan_amnt'] / (df['annual_inc'] + 1)
215
+ df['installment_to_income'] = df['installment'] / (df['annual_inc'] / 12 + 1)
216
+ ```
217
+ **Rationale**: Ratios normalize absolute amounts and capture relative financial stress.
218
+
219
+ #### Credit Utilization
220
+ ```python
221
+ df['credit_utilization_ratio'] = df['revol_util'] / 100
222
+ ```
223
+ **Rationale**: Convert percentage to ratio for consistent scaling.
224
+
225
+ #### Risk Encoding
226
+ ```python
227
+ grade_mapping = {'A': 7, 'B': 6, 'C': 5, 'D': 4, 'E': 3, 'F': 2, 'G': 1}
228
+ df['grade_numeric'] = df['grade'].map(grade_mapping)
229
+ ```
230
+ **Rationale**: Convert ordinal risk grades to numerical values preserving order.
231
+
232
+ #### Aggregate Features
233
+ ```python
234
+ df['total_credit_lines'] = df['open_acc'] + df['total_acc']
235
+ ```
236
+ **Rationale**: Total credit experience indicator.
237
+
238
+ ### Feature Engineering Validation
239
+ - Checked for infinite and NaN values in all new features
240
+ - Verified logical ranges and distributions
241
+ - Confirmed business logic alignment
242
+
243
+ ---
244
+
245
+ ## 8. Feature Selection
246
+
247
+ ### Multi-Stage Selection Process
248
+
249
+ #### Stage 1: Categorical Encoding
250
+ Applied Label Encoding to categorical variables for compatibility with numerical analysis methods.
251
+
252
+ #### Stage 2: Random Forest Feature Importance
253
+ ```python
254
+ rf = RandomForestClassifier(n_estimators=100, random_state=42)
255
+ rf.fit(X_temp, y_temp)
256
+ feature_importance = rf.feature_importances_
257
+ ```
258
+
259
+ **Why Random Forest for Feature Selection:**
260
+ - Handles mixed data types well
261
+ - Captures non-linear relationships
262
+ - Provides relative importance scores
263
+ - Less prone to overfitting than single trees
264
+
265
+ #### Stage 3: Top Features Identification
266
+ Selected top 15 features based on importance scores:
267
+
268
+ 1. **dti** (0.067): Debt-to-income ratio
269
+ 2. **loan_to_income_ratio** (0.061): Loan amount relative to income
270
+ 3. **credit_history_length** (0.061): Years of credit history
271
+ 4. **installment_to_income** (0.060): Monthly payment burden
272
+ 5. **debt_to_credit_ratio** (0.058): Debt utilization measure
273
+ 6. **revol_bal** (0.057): Revolving credit balance
274
+ 7. **installment** (0.054): Monthly payment amount
275
+ 8. **revol_util** (0.053): Revolving credit utilization
276
+ 9. **int_rate** (0.053): Interest rate
277
+ 10. **credit_utilization_ratio** (0.053): Utilization as ratio
278
+ 11. **annual_inc** (0.050): Annual income
279
+ 12. **total_credit_lines** (0.045): Total credit accounts
280
+ 13. **sub_grade_encoded** (0.045): Detailed risk grade
281
+ 14. **total_acc** (0.044): Total accounts ever
282
+ 15. **loan_amnt** (0.043): Loan amount
283
+
284
+ #### Stage 4: Multicollinearity Removal
285
+ Identified and removed highly correlated features (r > 0.8):
286
+
287
+ **Removed Features and Rationale:**
288
+ - `loan_to_income_ratio` (r=0.884 with dti): Keep dti as more standard metric
289
+ - `installment_to_income` (r=0.977 with loan_to_income_ratio): Redundant information
290
+ - `credit_utilization_ratio` (r=1.000 with revol_util): Perfect correlation
291
+ - `sub_grade_encoded` (r=0.974 with int_rate): Interest rate more direct
292
+ - `total_acc` (r=0.971 with total_credit_lines): Keep engineered feature
293
+ - `loan_amnt` (r=0.954 with installment): Monthly impact more relevant
294
+
295
+ ### Final Feature Set (9 features)
296
+ 1. **dti**: Debt-to-income ratio
297
+ 2. **credit_history_length**: Credit history in years
298
+ 3. **debt_to_credit_ratio**: Debt utilization measure
299
+ 4. **revol_bal**: Revolving balance amount
300
+ 5. **installment**: Monthly payment amount
301
+ 6. **revol_util**: Revolving utilization percentage
302
+ 7. **int_rate**: Interest rate
303
+ 8. **annual_inc**: Annual income
304
+ 9. **total_credit_lines**: Total credit accounts
305
+
306
+ ---
307
+
308
+ ## 9. Data Preprocessing for ANN
309
+
310
+ ### Why These Steps Were Necessary
311
+ Neural networks are sensitive to:
312
+ - Feature scale differences
313
+ - Input distribution characteristics
314
+ - Data leakage between train/test sets
315
+
316
+ ### Preprocessing Pipeline
317
+
318
+ #### Train-Test Split
319
+ ```python
320
+ X_train, X_test, y_train, y_test = train_test_split(
321
+ X_final, y_final,
322
+ test_size=0.2,
323
+ random_state=42,
324
+ stratify=y_final
325
+ )
326
+ ```
327
+ **Parameters Chosen:**
328
+ - **80/20 split**: Standard for large datasets
329
+ - **Stratified**: Maintains class balance in both sets
330
+ - **Random state**: Ensures reproducibility
331
+
332
+ #### Feature Scaling
333
+ ```python
334
+ scaler = StandardScaler()
335
+ X_train_scaled = scaler.fit_transform(X_train)
336
+ X_test_scaled = scaler.transform(X_test)
337
+ ```
338
+
339
+ **Why StandardScaler:**
340
+ - **Neural networks benefit from normalized inputs** (typically mean=0, std=1)
341
+ - **Prevents feature dominance** based on scale
342
+ - **Improves gradient descent convergence**
343
+ - **Fit only on training data** to prevent data leakage
344
+
345
+ ### Data Leakage Prevention
346
+ - Scaler fitted only on training data
347
+ - All transformations applied consistently to test data
348
+ - No future information used in feature creation
349
+
350
+ ---
351
+
352
+ ## 10. Final Dataset Summary
353
+
354
+ ### Dataset Characteristics
355
+ - **Training Set**: 316,824 samples (80%)
356
+ - **Test Set**: 79,206 samples (20%)
357
+ - **Features**: 9 carefully selected numerical features
358
+ - **Target Distribution**: Maintained 80.4% Fully Paid, 19.6% Charged Off
359
+
360
+ ### Feature Quality Metrics
361
+ - **Maximum correlation between features**: 0.632 (acceptable level)
362
+ - **All features scaled**: Mean β‰ˆ 0, Standard deviation β‰ˆ 1
363
+ - **No missing values**: Complete dataset ready for training
364
+ - **Feature importance range**: 0.043 to 0.067 (balanced contribution)
365
+
366
+ ### Model Readiness Checklist
367
+ βœ… **No missing values**
368
+ βœ… **Appropriate feature scaling**
369
+ βœ… **Balanced feature importance**
370
+ βœ… **Minimal multicollinearity**
371
+ βœ… **Stratified train-test split**
372
+ βœ… **Class distribution preserved**
373
+ βœ… **No data leakage**
374
+
375
+ ### Business Value Preserved
376
+ The final feature set maintains strong business interpretability:
377
+ - **Financial ratios**: dti, debt_to_credit_ratio, revol_util
378
+ - **Credit behavior**: credit_history_length, total_credit_lines
379
+ - **Loan characteristics**: int_rate, installment
380
+ - **Financial capacity**: annual_inc, revol_bal
381
+
382
+ ---
383
+
384
+ ## Methodology Strengths
385
+
386
+ ### 1. Domain-Driven Approach
387
+ - Feature engineering based on credit risk principles
388
+ - Business logic validation at each step
389
+ - Interpretable feature selection
390
+
391
+ ### 2. Statistical Rigor
392
+ - Systematic missing data analysis
393
+ - Correlation-based multicollinearity detection
394
+ - Stratified sampling for train-test split
395
+
396
+ ### 3. Model-Appropriate Preprocessing
397
+ - Standardization suitable for neural networks
398
+ - Feature selection optimized for predictive power
399
+ - Data leakage prevention measures
400
+
401
+ ### 4. Reproducibility
402
+ - Fixed random seeds throughout
403
+ - Documented preprocessing steps
404
+ - Saved preprocessing parameters
405
+
406
+ ---
407
+
408
+ ## Recommendations for ANN Training
409
+
410
+ ### 1. Architecture Suggestions
411
+ - **Input layer**: 9 neurons (one per feature)
412
+ - **Hidden layers**: Start with 2-3 layers, 16-32 neurons each
413
+ - **Output layer**: 1 neuron with sigmoid activation (binary classification)
414
+
415
+ ### 2. Training Considerations
416
+ - **Class weights**: Consider using class_weight='balanced' due to 80/20 split
417
+ - **Regularization**: Dropout layers (0.2-0.3) to prevent overfitting
418
+ - **Early stopping**: Monitor validation loss to prevent overtraining
419
+
420
+ ### 3. Evaluation Metrics
421
+ - **Primary**: AUC-ROC (handles class imbalance well)
422
+ - **Secondary**: Precision, Recall, F1-score
423
+ - **Business**: False positive/negative rates and associated costs
424
+
425
+ ### 4. Potential Enhancements
426
+ - **Feature interactions**: Consider polynomial features for top variables
427
+ - **Ensemble methods**: Combine ANN with tree-based models
428
+ - **Advanced sampling**: SMOTE if class imbalance proves problematic
429
+
430
+ ---
431
+
432
+ ## Conclusion
433
+
434
+ This EDA process transformed a raw dataset of 396,030 loan records with 27 features into a clean, analysis-ready dataset with 9 highly predictive features. The methodology emphasized:
435
+
436
+ - **Data quality** through systematic missing value handling
437
+ - **Feature relevance** through importance-based selection
438
+ - **Model compatibility** through appropriate preprocessing
439
+ - **Business alignment** through domain-knowledge integration
440
+
441
+ The resulting dataset is optimally prepared for neural network training while maintaining strong business interpretability and statistical validity.