File size: 4,818 Bytes
578a444 80ce31a 578a444 80ce31a 578a444 80ce31a 578a444 80ce31a 578a444 80ce31a 578a444 80ce31a 578a444 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
import streamlit as st
# Set page configuration
st.set_page_config(page_title="Decision Tree Theory", layout="wide")
# Updated CSS styling
st.markdown("""
<style>
.stApp {
background-color: #f2f6fa;
}
h1, h2, h3 {
color: #1a237e;
}
.custom-font, p, li {
font-family: 'Arial', sans-serif;
font-size: 18px;
color: #212121;
line-height: 1.6;
}
</style>
""", unsafe_allow_html=True)
# Title
st.markdown("<h1>Decision Tree</h1>", unsafe_allow_html=True)
# Introduction
st.markdown("""
A **Decision Tree** is a supervised learning method used for both classification and regression. It models decisions in a tree structure, where:
- The **Root Node** represents the full dataset.
- **Internal Nodes** evaluate features to split the data.
- **Leaf Nodes** give the output label or value.
It's like asking a series of "yes or no" questions to reach a final decision.
""", unsafe_allow_html=True)
# Entropy
st.markdown("<h2>Entropy: Quantifying Disorder</h2>", unsafe_allow_html=True)
st.markdown("""
**Entropy** helps measure randomness or impurity in data.
The formula for entropy is:
""")
st.image("entropy-formula-2.jpg", width=300)
st.markdown("""
If you have two classes (Yes/No) each with a 50% chance:
$$ H(Y) = - (0.5 \cdot \log_2(0.5) + 0.5 \cdot \log_2(0.5)) = 1 $$
This means maximum uncertainty.
""", unsafe_allow_html=True)
# Gini Impurity
st.markdown("<h2>Gini Impurity: Measuring Purity</h2>", unsafe_allow_html=True)
st.markdown("""
**Gini Impurity** is another metric that measures how often a randomly chosen element would be incorrectly classified.
The formula is:
""")
st.image("gini.png", width=300)
st.markdown("""
With 50% Yes and 50% No:
$$ Gini(Y) = 1 - (0.5^2 + 0.5^2) = 0.5 $$
A lower Gini means more purity.
""", unsafe_allow_html=True)
# Construction of Decision Tree
st.markdown("<h2>How a Decision Tree is Built</h2>", unsafe_allow_html=True)
st.markdown("""
The tree grows top-down, choosing the best feature at each step based on how well it splits the data. The process ends when:
- All samples in a node are of one class.
- A stopping condition like max depth is reached.
""", unsafe_allow_html=True)
# Iris Dataset
st.markdown("<h2>Iris Dataset Example</h2>", unsafe_allow_html=True)
st.markdown("""
This tree is trained on the famous **Iris dataset**, where features like petal length help classify the flower species.
""", unsafe_allow_html=True)
st.image("dt1 (1).jpg", caption="Decision Tree for Iris Dataset", use_container_width=True)
# Training & Testing - Classification
st.markdown("<h2>Training & Testing: Classification</h2>", unsafe_allow_html=True)
st.markdown("""
- During **training**, the model learns rules from labeled data using Gini or Entropy.
- In the **testing phase**, new samples are passed through the tree to make predictions.
Example: Predict Iris species based on its features.
""", unsafe_allow_html=True)
# Training & Testing - Regression
st.markdown("<h2>Training & Testing: Regression</h2>", unsafe_allow_html=True)
st.markdown("""
- For regression, the tree splits data to reduce **Mean Squared Error (MSE)**.
- Each leaf node predicts a continuous value (e.g., house price).
Example: Predicting house prices based on area, number of rooms, etc.
""", unsafe_allow_html=True)
# Pre-Pruning
st.markdown("<h2>Controlling Overfitting: Pre-Pruning</h2>", unsafe_allow_html=True)
st.markdown("""
**Pre-pruning** stops the tree from growing too large.
Techniques:
- **Max Depth**: Limits how deep the tree can go.
- **Min Samples Split**: Minimum data points needed to split a node.
- **Min Samples Leaf**: Minimum data points required in a leaf.
- **Max Features**: Restricts number of features used per split.
""", unsafe_allow_html=True)
# Post-Pruning
st.markdown("<h2>Post-Pruning: Simplifying After Training</h2>", unsafe_allow_html=True)
st.markdown("""
**Post-pruning** trims the tree **after** full training to reduce complexity.
Methods:
- **Cost Complexity Pruning**
- **Validation Set Pruning**
""", unsafe_allow_html=True)
# Feature Selection
st.markdown("<h2>Feature Selection with Trees</h2>", unsafe_allow_html=True)
st.markdown("""
Decision Trees can rank features by how much they reduce impurity at each split.
Here's the formula used:
""")
st.image("feature.png", width=500)
st.markdown("""
The higher the score, the more important the feature.
""", unsafe_allow_html=True)
# Implementation Link
st.markdown("<h2>Try It Yourself</h2>", unsafe_allow_html=True)
st.markdown(
"<a href='https://colab.research.google.com/drive/1SqZ5I5h7ivS6SJDwlOZQ-V4IAOg90RE7?usp=sharing' target='_blank' style='font-size: 16px; color: #1a237e;'>Open Jupyter Notebook</a>",
unsafe_allow_html=True
)
|