Spaces:
Sleeping
A newer version of the Streamlit SDK is available:
1.46.1
π Exploratory Data Analysis (EDA) - Loan Prediction
This document explains the key decisions made during the exploratory data analysis phase and the reasoning behind feature engineering choices.
π― Objective
The primary goal of EDA was to understand the LendingClub dataset, identify patterns in loan defaults, and prepare the data for optimal machine learning model performance.
π Dataset Overview
Initial Dataset Characteristics
- Total Records: ~400,000 loan applications
- Original Features: 23 features
- Target Variable:
loan_status
(binary: 0=Fully Paid, 1=Charged Off) - Class Distribution: ~78% Fully Paid, ~22% Charged Off (imbalanced)
Data Quality Assessment
Missing Values Analysis
# Key findings from missing value analysis
missing_values = df.isnull().sum()
high_missing_features = missing_values[missing_values > 0.3 * len(df)]
Decision: Removed features with >30% missing values to maintain data integrity:
emp_title
: 95% missingdesc
: 98% missingmths_since_last_delinq
: 55% missing
Data Types and Distributions
- Numerical Features: 15 features (loan amounts, rates, income, etc.)
- Categorical Features: 8 features (grade, purpose, home ownership, etc.)
- Date Features: 2 features (converted to numerical representations)
π Key EDA Insights
1. Target Variable Analysis
Default Rate by Loan Grade
Grade A: 5.8% default rate
Grade B: 9.4% default rate
Grade C: 13.6% default rate
Grade D: 18.9% default rate
Grade E: 25.8% default rate
Grade F: 33.2% default rate
Grade G: 40.1% default rate
Decision: Keep grade
as a strong predictor - clear inverse relationship with loan performance.
2. Feature Correlation Analysis
High Correlation Pairs Identified
loan_amnt
vsinstallment
: r = 0.95int_rate
vsgrade
: r = -0.89annual_inc
vsloan_amnt
: r = 0.33
Decision: Removed highly correlated features to prevent multicollinearity:
- Kept
installment
overfunded_amnt
(r = 0.99) - Retained
grade
oversub_grade
(more interpretable)
3. Numerical Feature Distributions
Loan Amount Distribution
- Range: $500 - $40,000
- Mean: $14,113
- Distribution: Right-skewed
- Decision: Applied log transformation to normalize distribution
Interest Rate Analysis
- Range: 5.32% - 30.99%
- Distribution: Multimodal (reflects different risk grades)
- Decision: Kept original scale - meaningful business interpretation
Annual Income
- Issues: Extreme outliers (>$1M annual income)
- Decision: Capped at 99th percentile to reduce outlier impact
4. Categorical Feature Analysis
Purpose of Loan
debt_consolidation: 58.2%
credit_card: 18.7%
home_improvement: 5.8%
other: 17.3%
Decision: Grouped low-frequency categories into "other" to reduce dimensionality.
Employment Length
- Issues: "n/a" and "< 1 year" categories
- Decision: Created ordinal encoding (0-10 years) with special handling for missing values
π οΈ Feature Engineering Decisions
1. Feature Selection Strategy
Applied multiple selection techniques:
- Correlation Analysis: Removed features with |r| > 0.9
- Random Forest Importance: Selected top 15 features
- SelectKBest (f_classif): Validated statistical significance
Final Feature Set (9 features):
loan_amnt
: Primary loan amountint_rate
: Interest rate (risk indicator)installment
: Monthly payment amountgrade
: LendingClub risk gradeemp_length
: Employment stabilityannual_inc
: Income leveldti
: Debt-to-income ratioopen_acc
: Credit utilizationpub_rec
: Public derogatory records
2. Data Preprocessing Pipeline
Numerical Features
# StandardScaler for numerical features
scaler = StandardScaler()
numerical_features = ['loan_amnt', 'int_rate', 'installment',
'annual_inc', 'dti', 'open_acc', 'pub_rec']
Reasoning: Neural networks perform better with normalized inputs.
Categorical Features
# Label Encoding for ordinal features
grade_mapping = {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7}
emp_length_mapping = {'< 1 year': 0, '1 year': 1, ..., '10+ years': 10, 'n/a': -1}
Reasoning: Preserves ordinal relationships while enabling numerical processing.
3. Handling Class Imbalance
Strategies Implemented:
- Weighted Loss Function: Applied class weights inversely proportional to frequency
- Stratified Sampling: Maintained class distribution in train/validation splits
- Focal Loss: Implemented to focus learning on hard examples
class_weights = compute_class_weight(
class_weight='balanced',
classes=np.unique(y_train),
y=y_train
)
π Feature Importance Analysis
Random Forest Feature Importance
- int_rate: 0.284 (Primary risk indicator)
- grade: 0.198 (LendingClub's risk assessment)
- dti: 0.156 (Debt burden)
- annual_inc: 0.134 (Income capacity)
- loan_amnt: 0.089 (Loan size)
Statistical Significance (f_classif)
All selected features showed p-value < 0.001, confirming statistical significance.
π¨ Visualization Insights
1. Default Rate by Grade
- Clear stepwise increase in default rates
- Justifies grade as primary feature
2. Interest Rate Distribution
- Multimodal distribution reflecting risk tiers
- Strong correlation with default probability
3. Income vs Default Rate
- Inverse relationship: higher income β lower default
- Supports inclusion in final model
βοΈ Ethical Considerations
Bias Analysis
- Income Bias: Checked for discriminatory patterns
- Employment Bias: Ensured fair treatment of employment categories
- Geographic Bias: Removed state-specific features to avoid regional discrimination
Fairness Metrics
- Implemented disparate impact analysis
- Monitored model performance across demographic groups
π§ Data Quality Improvements
1. Outlier Treatment
- Income: Capped at 99th percentile
- DTI: Removed impossible values (>100%)
- Employment Length: Handled missing values appropriately
2. Data Validation
- Implemented range checks for all numerical features
- Added consistency checks between related features
3. Feature Engineering Quality
- Created interaction terms where business logic supported
- Validated all transformations preserved interpretability
π Impact on Model Performance
Before EDA (All Features):
- Accuracy: 68.2%
- High overfitting risk
- Poor interpretability
After EDA (Selected Features):
- Accuracy: 70.1%
- Improved generalization
- Better business interpretability
- Reduced training time by 60%
π― Key Takeaways
- Feature Selection Crucial: Reduced from 23 to 9 features improved performance
- Domain Knowledge Important: LendingClub's grade system proved most valuable
- Class Imbalance Handling: Critical for real-world performance
- Outlier Management: Significant impact on model stability
- Business Interpretability: Maintained throughout process
π Preprocessing Pipeline Summary
def preprocess_loan_data(df):
# 1. Handle missing values
df = handle_missing_values(df)
# 2. Remove outliers
df = cap_outliers(df)
# 3. Encode categorical variables
df = encode_categorical_features(df)
# 4. Select important features
df = select_features(df, selected_features)
# 5. Scale numerical features
df_scaled = scale_features(df)
return df_scaled
π References
- LendingClub Dataset Documentation
- Scikit-learn Feature Selection Guide
- PyTorch Documentation for Neural Networks
- "Hands-On Machine Learning" by AurΓ©lien GΓ©ron
Next Steps: See Model Architecture Documentation for details on neural network design and training methodology.