Spaces:
Running
Running
import streamlit as st | |
import pandas as pd | |
import numpy as np | |
import matplotlib.pyplot as plt | |
import seaborn as sns | |
from sklearn.linear_model import LinearRegression | |
from sklearn.model_selection import train_test_split | |
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score | |
from statsmodels.stats.outliers_influence import variance_inflation_factor | |
from sklearn.datasets import fetch_california_housing | |
from sklearn.model_selection import cross_validate | |
import time | |
from sklearn.model_selection import learning_curve | |
def show_intro(): | |
with st.expander(f"➡️ What is Regression?"): | |
st.markdown(""" | |
**Regression** is a fundamental statistical technique used to understand and quantify the relationship between a **dependent variable (what you want to predict)** and one or more **independent variables (predictors)**. | |
--- | |
###### 🔍 Everyday Examples of Regression: | |
- 📈 Predicting **house prices** based on size, location, and number of bedrooms. | |
- 🎓 Estimating a student’s **final grade** based on hours of study and attendance. | |
- 🚗 Forecasting **fuel efficiency** based on engine size and weight of the car. | |
- 🧠 Predicting **IQ scores** or **height** based on parental traits (enter Galton! 👇) | |
--- | |
###### 👨👩👧👦 Galton’s Theory – *Regression to the Mean* | |
Sir Francis Galton, a 19th-century statistician and cousin of Charles Darwin, studied the heights of parents and their children. | |
He observed: | |
- Very tall parents tended to have children **shorter** than themselves. | |
- Very short parents tended to have children **taller** than themselves. | |
🧠 He coined the term **"regression to the mean"**, which means: | |
> "Extreme traits tend to be followed by traits closer to the average in the next generation." | |
--- | |
###### 👶 Real-Life Example: | |
- If both parents are exceptionally tall (say, 6'5"), their child is **likely tall**, but **closer to the average height** than the parents — maybe 6'2". | |
- Similarly, if parents are very short, the child’s height tends to “regress” toward the average population height. | |
This pattern **doesn't mean height is random**, just that genetics and environment **pull traits toward typical values** over time. | |
--- | |
Regression models in ML extend this idea — instead of modeling parent-child height, we model **any continuous outcome** based on relevant input variables. | |
""") | |
with st.expander("➡️ Industry Use-Cases of Regression Models"): | |
st.markdown(""" | |
###### 🏥 Healthcare | |
- 🔬 Estimating **patient recovery time** based on age, treatment type, and initial condition. | |
- 💉 Predicting **blood glucose levels** based on dietary habits and medication dosage. | |
- 🫀 Forecasting **hospital readmission rates** based on prior health records and discharge details. | |
###### 🛒 Retail | |
- 📦 Predicting **sales volume** based on pricing, seasonality, and promotional campaigns. | |
- 🛍️ Estimating **inventory demand** for specific SKUs using historical sales and trends. | |
- 👗 Forecasting **customer churn** likelihood using past purchase behavior and returns. | |
###### 🛍️ E-commerce | |
- 💸 Predicting **customer lifetime value (CLV)** based on purchase frequency and basket size. | |
- 🚚 Estimating **delivery time** based on warehouse location, item type, and order volume. | |
- 🧾 Forecasting **return probability** of products based on description, images, and reviews. | |
###### 💰 Finance | |
- 📊 Predicting **stock prices** or **bond yields** based on historical trends and market indicators. | |
- 🏦 Estimating **credit risk** or **loan default probability** using income, credit history, etc. | |
- 💳 Forecasting **spending patterns** on credit cards based on customer behavior. | |
###### 💊 Pharma & Life Sciences | |
- 🧪 Predicting **drug efficacy** based on dosage and patient demographics in clinical trials. | |
- 🦠 Estimating **disease progression** timelines based on early symptoms and test results. | |
- 💊 Forecasting **adverse drug reactions** from formulation and patient profiles. | |
""") | |
def simple_regression_example(): | |
with st.expander("➡️ Single Variable Regression (Manual Calculation)"): | |
# Sample data | |
advertising_spend = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) | |
sales_revenue = np.array([2.1, 3.9, 5.2, 6.0, 7.1, 8.1, 9.0, 10.2, 10.8, 12.0]) | |
# Regression coefficients (manual calculation) | |
x_mean = np.mean(advertising_spend) | |
y_mean = np.mean(sales_revenue) | |
b1 = np.sum((advertising_spend - x_mean) * (sales_revenue - y_mean)) / np.sum((advertising_spend - x_mean)**2) | |
b0 = y_mean - b1 * x_mean | |
predicted_sales = b0 + b1 * advertising_spend | |
# Two-column layout | |
col1, col2 = st.columns(2) | |
with col1: | |
st.markdown("###### 📊 Sample Data") | |
df = pd.DataFrame({ | |
'Advertising Spend (in lakhs)': advertising_spend, | |
'Sales Revenue (in lakhs)': sales_revenue | |
}) | |
st.dataframe(df) | |
with col2: | |
st.markdown("###### 📉 Linear Regression Formula") | |
st.markdown(f""" | |
The linear regression equation is: | |
**Sales Revenue = {b0:.2f} + {b1:.2f} × Advertising Spend** | |
Where: | |
- **b₀ (Intercept)**: Sales revenue when advertising spend is zero. | |
- **b₁ (Slope)**: Increase in revenue for each additional lakh spent. | |
###### Formula for Computing Coefficients | |
- **b₁ (Slope)** = (Σ(xᵢ - x̄)(yᵢ - ȳ)) / Σ(xᵢ - x̄)² | |
- **b₀ (Intercept)** = ȳ - b₁ × x̄ | |
""") | |
# Plotting the regression line | |
fig, ax = plt.subplots(figsize=(9,4)) | |
ax.scatter(advertising_spend, sales_revenue, color='blue', label='Actual') | |
ax.plot(advertising_spend, predicted_sales, color='red', label='Fitted Line') | |
ax.set_xlabel("Advertising Spend (in lakhs)", fontsize=10) | |
ax.set_ylabel("Sales Revenue (in lakhs)", fontsize=10) | |
ax.set_title("Linear Regression: Advertising Spend vs Sales Revenue", fontsize=10) | |
ax.tick_params(axis='both', labelsize=8) | |
ax.legend() | |
st.pyplot(fig) | |
with st.expander("➡️ Predict Sales Revenue from Advertising Spend"): | |
st.markdown(f"Use the trained regression model to forecast expected sales revenue 📈") | |
user_input = st.number_input( | |
"Enter Advertising Spend (in lakhs)", | |
min_value=1.0, | |
max_value=20.0, | |
value=5.0, | |
step=0.5, | |
format="%.1f" | |
) | |
if user_input: | |
predicted_value = b0 + b1 * user_input | |
st.success(f"🔮 Predicted Sales Revenue: **{predicted_value:.2f} lakhs**") | |
# Visualize prediction on the regression chart | |
fig, ax = plt.subplots(figsize=(9,4)) | |
ax.scatter(advertising_spend, sales_revenue, color='blue', label='Actual') | |
ax.plot(advertising_spend, predicted_sales, color='red', label='Fitted Line') | |
# Add dashed lines for prediction | |
ax.axvline(x=user_input, color='red', linestyle='--', linewidth=1) | |
ax.axhline(y=predicted_value, color='red', linestyle='--', linewidth=1) | |
ax.plot(user_input, predicted_value, 'ro') # predicted point | |
ax.set_xlabel("Advertising Spend (in lakhs)", fontsize=10) | |
ax.set_ylabel("Sales Revenue (in lakhs)", fontsize=10) | |
ax.set_title("Prediction on Regression Line", fontsize=10) | |
ax.tick_params(axis='both', labelsize=8) | |
ax.legend() | |
st.pyplot(fig) | |
with st.expander("➡️ Key Takeaways ..."): | |
st.markdown(""" | |
- 🔍 **Simplicity with Impact**: Even a simple linear model offers valuable foresight—linking investments (like ad spend) directly to outcomes (like sales revenue). | |
- 📊 **Data-Driven Decisions**: Enables leadership to make **objective** decisions, backed by quantitative evidence rather than gut feel. | |
- 🎯 **Budget Optimization**: Helps identify how much to invest to hit revenue targets—minimizing under or over-spending on campaigns. | |
- 📈 **Trend Insights**: Understanding whether returns from increased spending are **linear**, diminishing, or plateauing over time. | |
- 🧪 **Foundation for More Advanced Models**: This simple regression builds the base for multivariable models involving seasonality, regions, or digital channels. | |
""") | |
def load_ca_data(): | |
data = fetch_california_housing(as_frame=True) | |
X = data.frame.drop(['MedHouseVal'], axis=1) | |
y = data.frame['MedHouseVal'] | |
return data.frame, X,y | |
def vif_check(df): | |
X = df.drop(columns=['MedHouseVal']) | |
vif_data = pd.DataFrame() | |
vif_data["feature"] = X.columns | |
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] | |
return vif_data, X, df['MedHouseVal'] | |
def build_model(X_train, y_train): | |
model = LinearRegression() | |
model.fit(X_train, y_train) | |
return model | |
def main(): | |
st.markdown(f"**🧠 Regression Intuitions - Linear Regression Demo**") | |
show_intro() | |
simple_regression_example() | |
with st.expander("➡️ Load & View California Housing Dataset"): | |
df, X, y = load_ca_data() | |
st.dataframe(df.head()) | |
st.markdown(""" | |
The **California Housing Dataset** is based on data from the 1990 U.S. Census. | |
It contains information collected from block groups across California and is often used for regression tasks to predict housing values. | |
###### 📌 Columns Description: | |
- **MedInc** *(Median Income)*: Median income of households in the block group (in tens of thousands of dollars). | |
- **HouseAge** *(Median House Age)*: Median age of houses in the area. | |
- **AveRooms** *(Average Rooms)*: Average number of rooms per household. | |
- **AveBedrms** *(Average Bedrooms)*: Average number of bedrooms per household. | |
- **Population**: Total population of the block group. | |
- **AveOccup** *(Average Occupancy)*: Average number of people per household. | |
- **Latitude**: Geographical latitude of the block group. | |
- **Longitude**: Geographical longitude of the block group. | |
###### 🎯 Target Column: | |
- **MedHouseVal** *(Median House Value)*: This is the target variable to be predicted. | |
It represents the **median house value** in the block group (in hundreds of thousands of dollars). | |
""") | |
st.markdown("###### 🗺️ California Housing: Prices by Location") | |
fig, ax = plt.subplots(figsize=(12, 5)) | |
scatter = ax.scatter( | |
df["Longitude"], | |
df["Latitude"], | |
c=df["MedHouseVal"], | |
cmap="viridis", | |
s=10, | |
alpha=0.5 | |
) | |
ax.set_title("Median House Value across California", fontsize=14) | |
ax.set_xlabel("Longitude") | |
ax.set_ylabel("Latitude") | |
ax.grid(True) | |
# Add color bar to represent house value | |
cbar = plt.colorbar(scatter, ax=ax) | |
cbar.set_label("Median House Value ($100,000s)") | |
# Annotate major cities | |
ax.annotate("Los Angeles", xy=(-118.25, 34.05), xytext=(-121, 33.8), | |
arrowprops=dict(facecolor='red', arrowstyle="->"), fontsize=10, color='red') | |
ax.annotate("San Francisco", xy=(-122.42, 37.77), xytext=(-125, 38.5), | |
arrowprops=dict(facecolor='blue', arrowstyle="->"), fontsize=10, color='blue') | |
# Shade ocean region (rough approximation: west of longitude -123) | |
ax.axvspan(-125, -123, color='lightblue', alpha=0.3, label="Pacific Ocean") | |
# Add legend | |
ax.legend(loc="lower right") | |
st.pyplot(fig) | |
st.write(f""" | |
- Color represents housing value: darker → cheaper, lighter → more expensive. | |
- notice high-value clusters around coastal regions (e.g., around the Bay Area and Los Angeles). | |
""" | |
) | |
with st.expander("➡️ Key Challenges of California Housing Dataset (Regression vs Rule-Based Models)"): | |
st.markdown(""" | |
Understanding the limitations of both data and modeling approaches is vital for leaders making data-driven decisions. Below are the key challenges when using this dataset for **regression modeling**, especially compared to traditional **rule-based systems**: | |
###### 🔍 Data Challenges (Specific to Regression): | |
- **Non-linear Relationships**: Housing prices may not increase proportionally with income, age, or other features, making simple linear models insufficient. | |
- **Geographic Bias**: Locations like LA and SF have unique dynamics not captured by standard features—housing is expensive due to factors beyond income or age. | |
- **Data Outliers**: Some neighborhoods may have unusually high or low prices, skewing the model's predictions. | |
- **Capped Target Values**: The `MedHouseVal` was capped at $500,000 in the dataset, which can limit the model's ability to predict higher-end housing. | |
###### 🤖 Compared to Rule-Based Models: | |
- **Rule-based systems lack adaptability**: Rules like "if income > X, price > Y" cannot account for regional nuances, housing density, or socio-economic patterns. | |
- **Hard to scale**: Adding new rules for every edge case becomes complex and unmanageable over time. | |
- **Not data-driven**: Rule-based logic does not improve from historical data or learn from new patterns. | |
###### 🧭 Key Takeaway: | |
> Regression models offer adaptability and learning from patterns across vast geographies and populations. However, they require clean, unbiased data and continuous validation—unlike rule-based systems, which are simple but brittle and not future-proof. | |
""") | |
# with st.expander("➡️Linearity Check & VIF"): | |
# vif_data, X, y = vif_check(df) | |
# st.dataframe(vif_data) | |
with st.expander("➡️ Prepare Data for the regression model"): | |
st.markdown(""" | |
Creating training and test datasets is a fundamental step in building machine learning models. It ensures the model learns patterns **only from part of the data**, and is then **evaluated on unseen data** to measure its performance. | |
###### 🔧 Why Prepare Data? | |
- **Ensures Model Quality**: Models need structured and clean data to learn effectively. | |
- **Prevents Overfitting**: By separating training from testing, we prevent the model from simply memorizing the data. | |
- **Enables Generalization**: A well-prepared dataset ensures the model can make accurate predictions on new, real-world data. | |
###### 📦 Train-Test Split | |
- **Training Set**: Used by the model to learn patterns and relationships between input (features) and output (target). | |
- **Test Set**: Held back during training and used solely to evaluate model performance. It simulates how the model would perform in production. | |
###### ✅ Best Practices | |
- **Use an 80/20 or 70/30 split** depending on dataset size. | |
- **Stratify** if your target variable is imbalanced (more applicable in classification). | |
- **Set a random seed** (e.g., `random_state=42`) for reproducibility. | |
- **Clean and preprocess** before splitting to avoid data leakage. | |
- **Avoid using test data during model training or tuning**—this ensures an unbiased evaluation. | |
> 🔍 **Key point**: Proper data preparation is like setting the foundation of a building—without it, even the most advanced models can crumble in production. | |
""") | |
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) | |
# Display the number of samples in training and testing sets | |
st.write(f"Number of samples in training set: {X_train.shape[0]}") | |
st.write(f"Number of samples in testing set: {X_test.shape[0]}") | |
st.write("Train and test sets created.") | |
with st.expander("➡️ Build Linear Regression Model"): | |
#model = build_model(X_train, y_train) | |
st.write("Training the Linear Regression model...") | |
# Simulate training with progress bar | |
progress_bar = st.progress(0) | |
for i in range(100): | |
time.sleep(0.01) | |
progress_bar.progress(i + 1) | |
# Train the model | |
model = LinearRegression() | |
model.fit(X_train, y_train) | |
st.success("Model trained successfully.") | |
# Predict and compute metrics | |
y_pred = model.predict(X_test) | |
mae = mean_absolute_error(y_test, y_pred) | |
mse = mean_squared_error(y_test, y_pred) | |
rmse = np.sqrt(mse) | |
r2 = r2_score(y_test, y_pred) | |
st.markdown("###### 📊 Model Evaluation on Test Set") | |
st.write(f"**MAE**: {mae:.2f}") | |
st.write(f"**MSE**: {mse:.2f}") | |
st.write(f"**RMSE**: {rmse:.2f}") | |
st.write(f"**R² Score**: {r2:.2f}") | |
# Cross-validation to detect overfitting | |
st.markdown("###### 🔁 Cross-Validation Performance") | |
cv_results = cross_validate(model, X, y, cv=10, return_train_score=True, scoring='r2') | |
train_r2 = cv_results['train_score'] | |
test_r2 = cv_results['test_score'] | |
r2_df = pd.DataFrame({ | |
'Fold': list(range(1, 11)), | |
'Training R²': train_r2, | |
'Test R²': test_r2 | |
}) | |
fig, ax = plt.subplots(figsize=(9,5)) | |
ax.plot(r2_df['Fold'], r2_df['Training R²'], marker='o', label='Training R²', color='blue') | |
ax.plot(r2_df['Fold'], r2_df['Test R²'], marker='o', label='Test R²', color='green') | |
ax.set_title("Cross-Validation R² Scores") | |
ax.set_xlabel("Fold") | |
ax.set_ylabel("R² Score") | |
ax.legend() | |
ax.grid(True) | |
st.pyplot(fig) | |
st.dataframe(r2_df.style.format({'Training R²': '{:.2f}', 'Test R²': '{:.2f}'})) | |
st.write(""" | |
- ✅ **Consistent Training Performance**: | |
The training R² scores range from **0.59 to 0.63**, indicating a fairly **consistent learning pattern** across all 10 folds. | |
This means the model generalizes reasonably well on the training data. | |
- ⚠️ **Test Set Variability**: | |
The test R² scores range from **0.42 to 0.61**, showing **slightly higher variance** across folds. | |
Some folds show strong performance (e.g., Fold 2), while others drop noticeably (e.g., Fold 3). | |
- 🔁 **No Severe Overfitting Detected**: | |
If the training R² was very high (e.g., 0.9) and test R² was low (e.g., 0.3), that would indicate **overfitting**. | |
In this case, **training and test R² are fairly close**, suggesting the model is **not overfitting significantly**. | |
- 📉 **Room for Improvement**: | |
An average test R² around **0.52** implies that the model explains **just over 50% of the variance** in house values. | |
For business-critical applications like real estate pricing or policy decisions, we may consider: | |
- **Feature engineering** (e.g., regional segmentation), | |
- **Model tuning**, or | |
- **Trying more expressive models** like decision trees or gradient boosting. | |
""") | |
# learning curve | |
with st.expander("➡️ Was Training Data Sufficient? (Learning Curve Analysis)"): | |
st.markdown("###### 📊 Learning Curve Analysis") | |
# Generate learning curves | |
train_sizes, train_scores, test_scores = learning_curve( | |
model, X, y, cv=5, scoring='r2', train_sizes=np.linspace(0.1, 1.0, 10), shuffle=True, random_state=42 | |
) | |
# Calculate mean and std deviation | |
train_scores_mean = np.mean(train_scores, axis=1) | |
test_scores_mean = np.mean(test_scores, axis=1) | |
# Plotting | |
fig, ax = plt.subplots(figsize=(9,4)) | |
ax.plot(train_sizes, train_scores_mean, 'o-', color="blue", label="Training R²") | |
ax.plot(train_sizes, test_scores_mean, 'o-', color="green", label="Validation R²") | |
ax.set_title("Learning Curve: Linear Regression") | |
ax.set_xlabel("Number of Training Samples") | |
ax.set_ylabel("R² Score") | |
ax.legend(loc="best") | |
ax.grid(True) | |
st.pyplot(fig) | |
# Interpret results | |
st.write(""" | |
- ✅ **Training R² is high initially** (indicating the model learns patterns even with fewer samples). | |
- 📉 **Validation R² improves as training size increases**, then plateaus. | |
- 🧠 This suggests the model **benefits from more training data**, but after a certain point, **additional data does not significantly improve generalization**. | |
- 🔍 The **gap between training and validation curves** is relatively small, indicating **no severe overfitting**. | |
- 📌 **Conclusion**: The current dataset size seems **adequate**, and the model is learning well with the data provided. | |
""") | |
with st.expander("📊 Understand Feature Impact: Coefficients of the Linear Regression Model"): | |
importance = model.coef_ | |
features = X.columns | |
fig, ax = plt.subplots(figsize=(9,5)) | |
ax.barh(features, importance, color='skyblue') | |
ax.set_title("Feature Importance (Linear Regression Coefficients)") | |
ax.set_xlabel("Coefficient Value") | |
st.pyplot(fig) | |
st.markdown(""" | |
###### 🔍 Interpretation: | |
- Features with larger **absolute values** have a stronger effect on the predicted house value. | |
- A **positive coefficient** increases the predicted value. | |
- A **negative coefficient** decreases the predicted value. | |
###### 🧠 What it means for decision-makers: | |
- **Median Income** is a strong positive driver — wealthier areas tend to have higher housing values. | |
- **Latitude** has a negative coefficient — northern areas may have lower house prices. | |
- Helps focus strategic decisions on what really influences prices across California. | |
""") | |
with st.expander("🧠 Why Linear Regression Still Matters: Foundation for Deep Learning & Transformers"): | |
st.markdown(""" | |
Linear Regression may look simple, but it's far from trivial — it’s the **first building block** in the ladder to advanced AI models like **Deep Learning** and **Transformers**. | |
###### 📚 Conceptual Foundations: | |
- **Weights & Bias**: The core of linear regression is about learning weights and biases — which is exactly what **every neural network layer** does, just at scale. | |
- **Loss Minimization**: Linear regression minimizes **Mean Squared Error** — a principle used in training neural networks to adjust weights through **backpropagation**. | |
- **Linear Combinations**: Deep learning models, at their core, are just multiple layers of **linear transformations + non-linear activations**. | |
###### 🤖 Connect to Transformers: | |
- Transformer architectures (like GPT, BERT) use **linear projections** in attention mechanisms. | |
- Every layer in these models performs matrix multiplications — which is, again, just advanced **linear algebra and regression-like operations**. | |
###### 🏗️ Strategic Insight: | |
- A solid grasp of linear regression builds the intuition needed to understand more complex systems. | |
- Senior leaders can better evaluate ML and AI project feasibility and interpret outcomes by understanding these **fundamentals that scale**. | |
🔄 *"From Linear Regression to Transformers, it's all about modeling relationships and optimizing parameters — just with different levels of complexity and abstraction."* | |
""") | |
if __name__ == "__main__": | |
main() | |