|
import streamlit as st |
|
|
|
|
|
st.set_page_config(page_title="Decision Tree Theory", layout="wide") |
|
|
|
|
|
st.markdown(""" |
|
<style> |
|
.stApp { |
|
background-color: #f2f6fa; |
|
} |
|
h1, h2, h3 { |
|
color: #1a237e; |
|
} |
|
.custom-font, p, li { |
|
font-family: 'Arial', sans-serif; |
|
font-size: 18px; |
|
color: #212121; |
|
line-height: 1.6; |
|
} |
|
</style> |
|
""", unsafe_allow_html=True) |
|
|
|
|
|
st.markdown("<h1>Decision Tree</h1>", unsafe_allow_html=True) |
|
|
|
|
|
st.markdown(""" |
|
A **Decision Tree** is a supervised learning method used for both classification and regression. It models decisions in a tree structure, where: |
|
- The **Root Node** represents the full dataset. |
|
- **Internal Nodes** evaluate features to split the data. |
|
- **Leaf Nodes** give the output label or value. |
|
|
|
It's like asking a series of "yes or no" questions to reach a final decision. |
|
""", unsafe_allow_html=True) |
|
|
|
|
|
st.markdown("<h2>Entropy: Quantifying Disorder</h2>", unsafe_allow_html=True) |
|
st.markdown(""" |
|
**Entropy** helps measure randomness or impurity in data. |
|
|
|
The formula for entropy is: |
|
""") |
|
st.image("entropy-formula-2.jpg", width=300) |
|
st.markdown(""" |
|
If you have two classes (Yes/No) each with a 50% chance: |
|
|
|
$$ H(Y) = - (0.5 \cdot \log_2(0.5) + 0.5 \cdot \log_2(0.5)) = 1 $$ |
|
|
|
This means maximum uncertainty. |
|
""", unsafe_allow_html=True) |
|
|
|
|
|
st.markdown("<h2>Gini Impurity: Measuring Purity</h2>", unsafe_allow_html=True) |
|
st.markdown(""" |
|
**Gini Impurity** is another metric that measures how often a randomly chosen element would be incorrectly classified. |
|
|
|
The formula is: |
|
""") |
|
st.image("gini.png", width=300) |
|
st.markdown(""" |
|
With 50% Yes and 50% No: |
|
|
|
$$ Gini(Y) = 1 - (0.5^2 + 0.5^2) = 0.5 $$ |
|
|
|
A lower Gini means more purity. |
|
""", unsafe_allow_html=True) |
|
|
|
|
|
st.markdown("<h2>How a Decision Tree is Built</h2>", unsafe_allow_html=True) |
|
st.markdown(""" |
|
The tree grows top-down, choosing the best feature at each step based on how well it splits the data. The process ends when: |
|
- All samples in a node are of one class. |
|
- A stopping condition like max depth is reached. |
|
""", unsafe_allow_html=True) |
|
|
|
|
|
st.markdown("<h2>Iris Dataset Example</h2>", unsafe_allow_html=True) |
|
st.markdown(""" |
|
This tree is trained on the famous **Iris dataset**, where features like petal length help classify the flower species. |
|
""", unsafe_allow_html=True) |
|
st.image("dt1 (1).jpg", caption="Decision Tree for Iris Dataset", use_container_width=True) |
|
|
|
|
|
st.markdown("<h2>Training & Testing: Classification</h2>", unsafe_allow_html=True) |
|
st.markdown(""" |
|
- During **training**, the model learns rules from labeled data using Gini or Entropy. |
|
- In the **testing phase**, new samples are passed through the tree to make predictions. |
|
|
|
Example: Predict Iris species based on its features. |
|
""", unsafe_allow_html=True) |
|
|
|
|
|
st.markdown("<h2>Training & Testing: Regression</h2>", unsafe_allow_html=True) |
|
st.markdown(""" |
|
- For regression, the tree splits data to reduce **Mean Squared Error (MSE)**. |
|
- Each leaf node predicts a continuous value (e.g., house price). |
|
|
|
Example: Predicting house prices based on area, number of rooms, etc. |
|
""", unsafe_allow_html=True) |
|
|
|
|
|
st.markdown("<h2>Controlling Overfitting: Pre-Pruning</h2>", unsafe_allow_html=True) |
|
st.markdown(""" |
|
**Pre-pruning** stops the tree from growing too large. |
|
|
|
Techniques: |
|
- **Max Depth**: Limits how deep the tree can go. |
|
- **Min Samples Split**: Minimum data points needed to split a node. |
|
- **Min Samples Leaf**: Minimum data points required in a leaf. |
|
- **Max Features**: Restricts number of features used per split. |
|
""", unsafe_allow_html=True) |
|
|
|
|
|
st.markdown("<h2>Post-Pruning: Simplifying After Training</h2>", unsafe_allow_html=True) |
|
st.markdown(""" |
|
**Post-pruning** trims the tree **after** full training to reduce complexity. |
|
|
|
Methods: |
|
- **Cost Complexity Pruning** |
|
- **Validation Set Pruning** |
|
""", unsafe_allow_html=True) |
|
|
|
|
|
st.markdown("<h2>Feature Selection with Trees</h2>", unsafe_allow_html=True) |
|
st.markdown(""" |
|
Decision Trees can rank features by how much they reduce impurity at each split. |
|
|
|
Here's the formula used: |
|
""") |
|
st.image("feature.png", width=500) |
|
st.markdown(""" |
|
The higher the score, the more important the feature. |
|
""", unsafe_allow_html=True) |
|
|
|
|
|
st.markdown("<h2>Try It Yourself</h2>", unsafe_allow_html=True) |
|
st.markdown( |
|
"<a href='https://colab.research.google.com/drive/1SqZ5I5h7ivS6SJDwlOZQ-V4IAOg90RE7?usp=sharing' target='_blank' style='font-size: 16px; color: #1a237e;'>Open Jupyter Notebook</a>", |
|
unsafe_allow_html=True |
|
) |
|
|