initial upload of modeling pipeline and summary

Browse files

Files changed (6) hide show

README.md +90 -0
model/snowflake_model_evaluation.ipynb +0 -0
model/split_SMOTE_crossval.py +166 -0
predictions_analyzed/predictions_analyzed.ipynb +0 -0
predictions_analyzed/predictions_existing_flags.sql +222 -0
requirements.txt +16 -0

README.md ADDED Viewed

	@@ -0,0 +1,90 @@

+---
+license: mit
+tags:
+- scikit-learn
+- tabular
+- nonprofit
+- planned-giving
+- classification
+- snowflake
+---
+# Planned Giving Propensity Model
+A machine learning solution to optimize planned giving donor targeting for the National Parks Conservation Association (NPCA).
+> **Note**: This model is not currently deployed or downloadable due to data privacy constraints. This repository shares the modeling approach, evaluation strategy, and relevant pipeline components for reproducibility and educational use.
+## Project Overview
+This project implements a Random Forest classifier to identify potential planned giving donors, with the goal of improving mailing efficiency and response rates. The model processes donor data through Snowflake’s computing infrastructure and uses SMOTE to handle class imbalance.
+## Key Results
+- **PR-AUC**: 0.88 — strong performance on imbalanced data
+- **F1 Score**: 0.8125
+- **Precision**: 0.7558
+- **Recall**: 0.8784 — high capture rate of known planned givers
+- **1,019 new high-potential donor predictions** for targeted outreach
+## Technical Implementation
+### Data Pipeline
+- Donor data extracted from CRM into Snowflake
+- Modular Python scripts for feature engineering and cleaning
+- SMOTE oversampling to address class imbalance
+### Machine Learning
+- Random Forest classifier with `scikit-learn`
+- Stratified cross-validation and grid search
+- Multiple imputation strategies (MICE, mean, median)
+- Key temporal features (e.g., time since last gift)
+📂 [Training Script](./model/split_SMOTE_crossval.py)
+📓 [Evaluation Notebook](./model/snowflake_model_evaluation.py)
+## Model Performance Insights
+Post-modeling analysis validated predictions against known donor engagement indicators:
+- **66.3%** of predicted donors were already flagged as prospects by fundraisers
+- **37.6%** are major donor households
+- **18%** are members of the Mather Legacy Society
+### Top 5 Most Important Features
+1. Highest Previous Contribution (22.8%)
+2. Most Recent Contribution (20.1%)
+3. Years Since HPC Gift (14.6%)
+4. Total Amount (14.3%)
+5. Years Since MRC Gift (11.2%)
+### Demographics of Predicted Donors
+- Average age: 69
+- Giving history: 16 years (on average)
+- Median total giving: $10,932
+- Average number of transactions: 18
+## Tools and Technologies
+- `scikit-learn`, `pandas`, `numpy`
+- Snowflake
+- `imbalanced-learn`, `matplotlib`, `seaborn`
+## Repository Structure
+```plaintext
+├── model/
+│   ├── split_SMOTE_crossval.py        # ML model executed on Snowflake
+│   └── snowflake_model_evaluation.py  # Model evaluation and visualization
+├── predictions_analyzed/              # Post-modeling analysis
+│   ├── predictions_analyzed.ipynb     # Model concurrence evaluation
+├── requirements.txt
+└── README.md
+```
+## Potential Future Improvements
+- Schedule automated data refresh and model retraining
+- Incorporate additional feature engineering
+- Develop dashboard for tracking model performance
+*Note: Full project repository: [GitHub – dbouquin/bequest_modeling](https://github.com/dbouquin/bequest_modeling)*

model/snowflake_model_evaluation.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

model/split_SMOTE_crossval.py ADDED Viewed

	@@ -0,0 +1,166 @@

+import snowflake.snowpark as snowpark
+from snowflake.snowpark.functions import col
+from sklearn.experimental import enable_iterative_imputer
+from sklearn.impute import SimpleImputer, IterativeImputer
+from imblearn.over_sampling import SMOTE
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.metrics import classification_report, accuracy_score, precision_recall_curve, auc
+from sklearn.model_selection import StratifiedKFold
+import pandas as pd
+import numpy as np
+import json
+def main(session: snowpark.Session):
+    # Load data from table
+    df = session.table("PUBLIC.BEQUESTS_CLEAN").to_pandas()
+    # Define imputers (only mean and mice)
+    imputers = {
+        'mean': SimpleImputer(strategy='mean'),
+        'median': SimpleImputer(strategy='median'),
+        'mice': IterativeImputer(random_state=42)
+    }
+    # Store results
+    results_dead = []
+    results_alive = []
+    results_modeling = []
+    # Function to evaluate imputation method
+    def evaluate_imputation(df, imputer_name, imputer):
+        # Impute BIRTH_YEAR
+        df['BIRTH_YEAR'] = imputer.fit_transform(df[['BIRTH_YEAR']])
+        # Encode categorical variables
+        df = pd.get_dummies(df, columns=['REGION_CODE'], drop_first=True)
+        # Define features after one-hot encoding
+        feature_columns = [
+            'TOTAL_TRANSACTIONS',
+            'TOTAL_AMOUNT',
+            'FIRST_GIFT_AMOUNT',
+            'MRC_AMOUNT',
+            'HPC_AMOUNT',
+            'YEARS_SINCE_FIRST_GIFT',
+            'YEARS_SINCE_MRC_GIFT',
+            'YEARS_SINCE_HPC_GIFT',
+            'BIRTH_YEAR'
+        ] + [col for col in df.columns if col.startswith('REGION_CODE_')]
+        # Separate dead and alive individuals
+        df_dead = df[df['DEATH_FLAG'] == 1]
+        df_alive = df[df['DEATH_FLAG'] == 0]
+        # Train model on dead individuals
+        if len(df_dead) > 0:
+            X_dead = df_dead[feature_columns]
+            y_dead = df_dead['BEQUEST_RECEIVED']
+            ROI_FAMILY_ID_dead = df_dead['ROI_FAMILY_ID']
+            # Cross-validation setup
+            skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
+            smote = SMOTE(random_state=42)
+            model = RandomForestClassifier(random_state=42, n_jobs=-1)  # Use all available cores
+            # Cross-validated predictions
+            y_pred_dead = np.zeros(len(y_dead))
+            y_pred_proba_dead = np.zeros(len(y_dead))
+            for train_index, test_index in skf.split(X_dead, y_dead):
+                X_train, X_test = X_dead.iloc[train_index], X_dead.iloc[test_index]
+                y_train, y_test = y_dead.iloc[train_index], y_dead.iloc[test_index]
+                X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
+                model.fit(X_train_res, y_train_res)
+                y_pred_dead[test_index] = model.predict(X_test)
+                y_pred_proba_dead[test_index] = model.predict_proba(X_test)[:, 1]  # Probability for class 1
+            # Evaluation for dead individuals
+            accuracy_dead = accuracy_score(y_dead, y_pred_dead)
+            precision_dead, recall_dead, _ = precision_recall_curve(y_dead, y_pred_proba_dead)
+            auc_pr_dead = auc(recall_dead, precision_dead)
+            report_dead = classification_report(y_dead, y_pred_dead, output_dict=True)
+            model.fit(X_dead, y_dead)
+            feature_importance_dead = pd.DataFrame({
+                'Feature': X_dead.columns,
+                'Importance': model.feature_importances_
+            }).sort_values(by='Importance', ascending=False)
+            results_dead.append({
+                'imputer': imputer_name,
+                'accuracy': accuracy_dead,
+                'auc_pr': auc_pr_dead,
+                'report': pd.DataFrame(report_dead).transpose(),
+                'feature_importance': feature_importance_dead,
+                'ROI_FAMILY_ID': ROI_FAMILY_ID_dead,
+                'y_true': y_dead,
+                'y_pred': y_pred_dead
+            })
+            results_modeling.append({
+                'imputer': imputer_name,
+                'accuracy': accuracy_dead,
+                'auc_pr': auc_pr_dead,
+                'classification_report': json.dumps(report_dead),
+                'feature_importance': feature_importance_dead.to_dict(orient='list')
+            })
+        # Predict on alive individuals
+        if len(df_alive) > 0:
+            X_alive = df_alive[feature_columns]
+            y_pred_alive = model.predict(X_alive)
+            ROI_FAMILY_ID_alive = df_alive['ROI_FAMILY_ID']
+            results_alive.append({
+                'imputer': imputer_name,
+                'ROI_FAMILY_ID': ROI_FAMILY_ID_alive,
+                'y_pred': y_pred_alive
+            })
+    # Evaluate each imputation method
+    for imputer_name, imputer in imputers.items():
+        evaluate_imputation(df.copy(), imputer_name, imputer)
+    # Print the modeling results for dead individuals
+    for result in results_dead:
+        print(f"Imputer: {result['imputer']} (Dead)")
+        print("Accuracy:", result['accuracy'])
+        print("AUC-PR:", result['auc_pr'])
+        print("Classification Report:")
+        print(result['report'])
+        print("Feature Importance:")
+        print(result['feature_importance'])
+        print("\n" + "-"*50 + "\n")
+    # Combine all dead predictions into a single DataFrame
+    predictions_dead_df = pd.concat([
+        pd.DataFrame({
+            'ROI_FAMILY_ID': result['ROI_FAMILY_ID'],
+            'imputer': result['imputer'],
+            'y_true': result['y_true'],
+            'y_pred': result['y_pred'],
+            'status': 'dead'
+        }) for result in results_dead
+    ], ignore_index=True)
+    # Combine all alive predictions into a single DataFrame
+    predictions_alive_df = pd.concat([
+        pd.DataFrame({
+            'ROI_FAMILY_ID': result['ROI_FAMILY_ID'],
+            'imputer': result['imputer'],
+            'y_pred': result['y_pred'],
+            'status': 'alive'
+        }) for result in results_alive
+    ], ignore_index=True)
+    # Write the dead predictions DataFrame to a new table
+    session.write_pandas(predictions_dead_df, 'BEQUEST_PREDICTIONS_DEAD', auto_create_table=True)
+    # Write the alive predictions DataFrame to a new table
+    session.write_pandas(predictions_alive_df, 'BEQUEST_PREDICTIONS_ALIVE', auto_create_table=True)
+    # Write the modeling results to a new table
+    modeling_results_df = pd.DataFrame(results_modeling)
+    session.write_pandas(modeling_results_df, 'BEQUEST_MODELING_RESULTS', auto_create_table=True)
+    # Return string
+    return "Data processing, prediction, and table creation completed successfully."

predictions_analyzed/predictions_analyzed.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

predictions_analyzed/predictions_existing_flags.sql ADDED Viewed

	@@ -0,0 +1,222 @@

+WITH
+pred_beq AS (
+    SELECT ROIFAMILYID
+    FROM UP_16425_1193735
+),
+account_profile_family AS (
+    SELECT roi_id,
+           roi_family_id
+    FROM v_account_profile_family apf
+    WHERE EXISTS (
+            SELECT *
+            FROM pred_beq pb
+            WHERE pb.ROIFAMILYID = apf.roi_family_id
+            )
+),
+account_profile AS (
+    SELECT roi_id,           -- Added roi_id here to reference it later
+           account_classification
+    FROM v_account_profile ap
+    WHERE EXISTS (
+            SELECT *
+            FROM account_profile_family apf
+            WHERE apf.roi_id = ap.roi_id
+            )
+),
+primaryAddresses AS (
+    SELECT roi_id,
+        city,
+        state_code as state,
+        zipcode
+    FROM v_account_primary_address
+    WHERE EXISTS (
+            SELECT *
+            FROM account_profile_family apf
+            WHERE apf.roi_id = v_account_primary_address.roi_id
+        )
+),
+flag_universe AS (
+    SELECT roi_id,
+        MAX(
+            CASE
+                WHEN flagstd_code LIKE 'MD_GROUP%' THEN 'Y'
+                ELSE 'N'
+            END
+        ) AS MD_GROUP,
+        MAX(
+            CASE
+                WHEN flagstd_code LIKE 'MD_TFP_HIGH' THEN 'Y'
+                ELSE 'N'
+            END
+        ) AS DEV_TFP,
+        MAX(
+            CASE
+                WHEN flagstd_code like 'MLS%' THEN 'Y'
+                ELSE 'N'
+            END
+        ) AS MLS,
+        MAX(
+            CASE
+                WHEN flagstd_code like 'REGCOUNCIL%' THEN 'Y'
+                ELSE 'N'
+            END
+        ) AS REG_COUNCIL,
+        MAX(
+            CASE
+                WHEN flagstd_code like 'NPROLE_COUNCIL' THEN 'Y'
+                ELSE 'N'
+            END
+        ) AS Nat_Council,
+        MAX(
+            CASE
+                WHEN flagstd_code like 'NPROLE_BOARD%' THEN 'Y'
+                ELSE 'N'
+            END
+        ) AS Board_or_Emeritus,
+        MAX(
+            CASE WHEN flagstd_code like 'SUSTAINER%' THEN 'Y'
+            ELSE 'N'
+            END
+        ) AS SUSTAINER,
+        MAX(
+            CASE
+                WHEN flagstd_code like 'SF_%' THEN v_account_flag_active.flagstd_name
+                ELSE NULL
+            END
+        ) AS SUPERFUND,
+        MAX(
+            CASE
+                WHEN flagstd_code like 'SF_GROUP5_PLG_PROSP_FY24' THEN v_account_flag_active.flagstd_name
+                ELSE NULL
+            END
+        ) AS SUPERFUND_PlannedGift,
+        MAX(
+            CASE
+                WHEN flagstd_code = 'NPROLE_VETCOUNCIL' THEN 'Y'
+                ELSE 'N'
+            END
+        ) AS Vet_Council,
+        MAX(
+            CASE
+                WHEN flagstd_code like 'CF_GROUP_%' THEN 'Y'
+                ELSE 'N'
+            END
+        ) AS CF_GROUP
+    FROM v_account_flag_active
+    WHERE
+(
+       (flagstd_code LIKE 'MD_GROUP%' AND end_date IS NULL)
+    OR (flagstd_code LIKE 'MD_TFP_HIGH' AND end_date IS NULL)
+    OR (flagstd_code like 'MLS%' AND end_date IS NULL)
+    OR (flagstd_code LIKE 'REGCOUNCIL%' AND end_date IS NULL)
+    OR (flagstd_code LIKE 'NPROLE_COUNCIL' AND end_date IS NULL)
+    OR (flagstd_code LIKE 'NPROLE_BOARD%' AND end_date IS NULL)
+    OR (flagstd_code like 'SF_%' AND end_date IS NULL)
+    OR (flagstd_code = 'NPROLE_VETCOUNCIL' AND end_date IS NULL)
+    OR flagstd_code LIKE 'CF_GROUP_%' /*Active or inactive CF_GROUP flags*/
+    )
+        AND EXISTS (
+            SELECT *
+            FROM account_profile_family apf
+            WHERE apf.roi_id = v_account_flag_active.roi_id
+        )
+    GROUP BY roi_id
+),
+universe AS (
+    SELECT
+        flag_universe.roi_id,
+        flag_universe.MD_GROUP,
+        flag_universe.DEV_TFP,
+        flag_universe.MLS,
+        flag_universe.REG_COUNCIL,
+        flag_universe.Nat_Council,
+        flag_universe.Board_or_Emeritus,
+        flag_universe.Vet_Council,
+        flag_universe.CF_GROUP,
+        flag_universe.SUPERFUND,
+        flag_universe.SUPERFUND_PlannedGift
+    FROM flag_universe
+),
+criticalFlags AS (
+    SELECT roi_id,
+        MAX(
+            CASE
+                WHEN flagstd_code like 'SF_%'
+                and flagstd_code <> 'SF_GROUP5_PLG_PROSP_FY24' THEN v_account_flag_active.flagstd_name
+                ELSE NULL
+            END
+        ) AS SUPERFUND,
+        MAX(
+            CASE
+                WHEN flagstd_code like 'SF_GROUP5_PLG_PROSP_FY24'
+                and flagstd_code NOT LIKE '%5' THEN v_account_flag_active.flagstd_name
+                ELSE NULL
+            END
+        ) AS SUPERFUND_PlannedGift,
+        MAX(
+            CASE
+                WHEN flagstd_code like 'SOLICIT_NO_MAIL' THEN 'Y'
+                ELSE 'N'
+            END
+        ) AS SOLICIT_NO_MAIL,
+        MAX(
+            CASE
+                WHEN flagstd_code like 'NO_EMAIL%' THEN 'Y'
+                ELSE 'N'
+            END
+        ) AS NO_EMAIL,
+        MAX(
+            CASE
+                WHEN flagstd_code like 'NPROLE_STAFF' THEN 'Y'
+                ELSE 'N'
+            END
+        ) AS np_role_staff
+    FROM v_account_flag_active
+    WHERE (
+            flagstd_code LIKE 'SF_%'
+            OR flagstd_code like 'NO_EMAIL'
+            OR flagstd_code like 'SOLICIT_NO_MAIL'
+            OR flagstd_code like 'NPROLE_STAFF'
+            OR flagstd_code like 'SOLICIT_NO_PHONE'
+        )
+        AND EXISTS (
+            SELECT *
+            FROM universe
+            WHERE universe.roi_id = v_account_flag_active.roi_id
+        )
+    GROUP BY roi_id
+)
+SELECT
+    apf.roi_family_id,                -- Include roi_family_id
+    apf.roi_id,                       -- Include roi_id
+    primaryAddresses.city,
+    primaryAddresses.state,
+    primaryAddresses.zipcode,
+    COALESCE(criticalFlags.SOLICIT_NO_MAIL, 'N') AS SOLICIT_NO_MAIL,
+    COALESCE(criticalFlags.NO_EMAIL, 'N') AS NO_EMAIL,
+    COALESCE(universe.MD_GROUP, 'N') AS MD_GROUP,
+    COALESCE(universe.DEV_TFP, 'N') AS DEV_TFP,
+    COALESCE(universe.MLS, 'N') AS MLS,
+    COALESCE(universe.REG_COUNCIL, 'N') AS REG_COUNCIL,
+    COALESCE(universe.Nat_Council, 'N') AS Nat_Council,
+    COALESCE(universe.Vet_Council, 'N') AS Vet_Council,
+    COALESCE(universe.CF_GROUP, 'N') AS CF_GROUP,
+    COALESCE(universe.Board_or_Emeritus, 'N') AS Board_or_Emeritus,
+    COALESCE(universe.SUPERFUND, criticalFlags.SUPERFUND, NULL) AS SUPERFUND,
+    COALESCE(universe.SUPERFUND_PlannedGift, criticalFlags.SUPERFUND_PlannedGift, NULL) AS SUPERFUND_PlannedGift,
+    COALESCE(criticalFlags.np_role_staff, 'N') AS np_role_staff
+FROM
+    account_profile_family apf
+JOIN
+    universe ON apf.roi_id = universe.roi_id
+JOIN
+    primaryAddresses ON apf.roi_id = primaryAddresses.roi_id
+LEFT JOIN
+    criticalFlags ON apf.roi_id = criticalFlags.roi_id

requirements.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+# Core ML libraries
+scikit-learn>=1.0.0
+pandas>=1.3.0
+numpy>=1.20.0
+imbalanced-learn>=0.8.0
+# Snowflake integration
+snowflake-snowpark-python>=1.0.0
+# Visualization (for evaluation notebook)
+seaborn>=0.11.0
+matplotlib>=3.4.0
+# Jupyter for running evaluation notebook
+jupyter>=1.0.0
+ipykernel>=6.0.0