Spaces:
Running
Running
File size: 12,897 Bytes
7da6924 a9227a2 7da6924 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 |
import streamlit as st
from transformers import DetrImageProcessor, DetrForObjectDetection
from PIL import Image, ImageDraw, ImageFont
import torch
import io
# Load model and processor
@st.cache_resource
def load_model():
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
return processor, model
def draw_boxes(image, results, labels):
draw = ImageDraw.Draw(image)
font = ImageFont.load_default()
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
box = [round(i, 2) for i in box.tolist()]
draw.rectangle(box, outline="red", width=2)
label_text = f"{labels[label.item()]}: {score:.2f}"
draw.text((box[0], box[1] - 10), label_text, fill="red", font=font)
return image
def main():
st.set_page_config(page_title="๐ Object Detection Demo", layout="centered")
st.markdown("**๐ฏ Object Detection using Transformers (DETR)**")
st.write("Upload an image to detect objects using a pre-trained Transformer model: `facebook/detr-resnet-50`.")
uploaded_file = st.file_uploader("Upload an Image", type=["jpg", "jpeg", "png"])
if uploaded_file:
image = Image.open(uploaded_file).convert("RGB")
# Create two columns for displaying images
col1, col2 = st.columns(2)
with col1:
st.image(image, caption="Original Image", width=200)
processor, model = load_model()
# Preprocess
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
# Post-process
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]
st.markdown("**๐ฆ Detected Objects**")
if results["boxes"].shape[0] == 0:
st.warning("No objects detected with confidence > 90%")
else:
labeled_image = image.copy()
labeled_image = draw_boxes(labeled_image, results, model.config.id2label)
with col2:
st.image(labeled_image, caption="Detected Objects", width=200)
for score, label in zip(results["scores"], results["labels"]):
st.write(f"- **{model.config.id2label[label.item()]}** โ Confidence: `{score:.2f}`")
with st.expander("โน๏ธ What is Pretraining and Which Model Are We Using?"):
st.markdown("""
**Pretraining** is like teaching a model some basic skills before asking it to do a specific task.
Just like a child first learns shapes, colors, and objects before learning to name or sort them,
a pre-trained model has already learned to recognize **general patterns** in thousands or millions of images.
๐ In our app, we are using a pre-trained model called:
### ๐ `facebook/detr-resnet-50`
- **DETR** stands for *DEtection TRansformer*. It's a special kind of deep learning model made by Facebook AI.
- It's been **trained on COCO dataset**, which includes 80 common object types like people, cars, dogs, chairs, etc.
- Because it's pre-trained, we **donโt need to train it ourselves** โ it already knows how to detect these objects!
๐ง So when you upload an image, the model just applies what it has already learned during pretraining to spot things in your image.
""")
with st.expander("โน๏ธ What is Object Detection?"):
st.markdown("""
Object detection is like playing "I spy with my little eye" โ but using AI!
Instead of just saying "there's a dog", object detection can also say **where** the dog is in the image using a **bounding box**.
It helps in:
- ๐ป Self-driving cars (detecting pedestrians and vehicles)
- ๐ธ Security cameras (detecting intrusions)
- ๐ฆ Inventory systems (detecting objects on shelves)
""")
with st.expander("โน๏ธ How Does the DETR Model Work?"):
st.markdown("""
**DETR** stands for **DEtection TRansformer**, a cutting-edge model developed by Facebook AI Research. It combines **Convolutional Neural Networks (CNNs)** and **Transformers** โ the same architecture used in ChatGPT and BERT โ to detect objects in images.
###### ๐๏ธ How is it Different?
Most older object detection models work in stages:
- First, they **generate regions of interest (ROIs)** (like boxes where something might be).
- Then they **classify** what's inside each box (cat, dog, etc.).
But DETR skips this multi-step process by using a **Transformer** to directly:
- Look at the image
- Predict **all objects and their locations at once** (end-to-end)
###### โ๏ธ Key Components of DETR:
- **CNN Backbone (like ResNet):** Extracts visual features from the image (e.g., edges, textures)
- **Transformer Encoder-Decoder:** Understands **global relationships** between features (e.g., where objects are in relation to each other)
- **Prediction Heads:** Predicts bounding boxes and labels
###### โจ Why is DETR Special?
- No need for complicated anchor boxes or region proposals
- Handles overlapping or cluttered objects better
- Learns in a more "human-like" way โ understanding the **whole scene**, not just pieces
###### ๐ฆ Pretrained Model in This App:
In this app, we're using **`facebook/detr-resnet-50`**, a model trained on **COCO dataset** (Common Objects in Context) with:
- 80 object categories (like person, car, bottle, chair)
- Over 100,000 images for training
It can detect things like:
""")
col1, col2, col3, col4 = st.columns(4)
with col1:
st.write("๐ถ Dogs")
st.write("๐ฑ Cats")
st.write("๐ง People")
st.write("๐ Cars")
st.write("๐ Buses")
st.write("๐ด Bicycles")
st.write("๐๏ธ Motorcycles")
with col2:
st.write("โ๏ธ Airplanes")
st.write("๐ค Boats")
st.write("๐ช Chairs")
st.write("๐๏ธ Beds")
st.write("๐ฅ๏ธ Monitors")
st.write("๐ฑ Cell Phones")
st.write("๐ท Cameras")
with col3:
st.write("๐ Apples")
st.write("๐ Bananas")
st.write("๐ Pizzas")
st.write("๐ฅซ Cans")
st.write("๐ฝ๏ธ Dining Tables")
st.write("๐๏ธ Couches")
st.write("๐งด Bottles")
with col4:
st.write("๐ Handbags")
st.write("๐งณ Suitcases")
st.write("โบ Tents")
st.write("๐ผ๏ธ Paintings")
st.write("๐ฆ Traffic Lights")
st.write("๐ Stop Signs")
st.write("๐ Cows")
with st.expander("๐ Real-World Use Cases of Object Detection"):
st.markdown("""
Object detection models like DETR are widely used in many industries. Here are some practical examples:
๐ **Security & Surveillance**
- Detecting people in restricted zones
- Identifying abandoned objects in public places
๐ฅ **Healthcare**
- Analyzing X-rays and MRI scans to detect tumors or anomalies
- Assisting doctors in surgical planning
๐ **Autonomous Vehicles**
- Identifying pedestrians, vehicles, traffic lights, and road signs in real-time
๐๏ธ **Retail**
- Automated checkout systems (e.g., Amazon Go)
- Shelf inventory monitoring using cameras
๐๏ธ **Construction & Safety**
- Monitoring helmet usage and safety compliance on sites
- Tracking equipment and workers
๐ธ **Aerial & Drone Imagery**
- Detecting objects (cars, animals, buildings) from satellite or drone images
๐ฑ **Mobile Applications**
- Real-time AR object tagging (e.g., identifying products in camera view)
๐ฎ **Gaming & Sports**
- Player and object tracking in sports analytics
- Enhanced real-time visuals in AR/VR environments
""")
with st.expander("๐ Categories vs Real-World Use Cases"):
st.markdown("""
###### ๐ฏ What DETR Can Detect (Pretrained Model)
The base DETR model (`facebook/detr-resnet-50`) is trained on the **COCO dataset**, which includes 91 common object categories, such as:
- ๐ง Person ๐ Car ๐ Bus ๐๏ธ Motorcycle
- ๐ถ Dog ๐ฑ Cat ๐ Cow
- ๐ Apple ๐ Banana
- ๐๏ธ Sofa ๐ช Chair ๐๏ธ Bed
- ๐บ TV ๐ฅ๏ธ Laptop ๐ท Camera
- ๐ฆ Bird ๐ Fish
This is great for **general object detection**, but there are some gaps when it comes to real-world applications.
---
###### โ What It Misses for Real-World Use Cases
In specialized or industrial domains, we often need to detect:
###### ๐ฅ **Medical Imaging**
- Tumors, organs (lungs, liver), anomalies
> โ ๏ธ COCO doesnโt have these.
###### ๐ก๏ธ **Security/Surveillance**
- Weapons, intrusions, suspicious behavior
> Not covered in COCO.
###### ๐ญ **Manufacturing**
- Machine parts, tools, defects
> Also outside COCOโs categories.
###### ๐งช **Scientific Research**
- Cells, molecules, lab equipment
###### ๐ฆ **Retail**
- Brands, barcodes, product layouts
---
###### โ
Solutions & Next Steps
- **`Fine-tune DETR`** on custom datasets for your domain.
- Use **domain-specific pretrained models** (e.g., BioMed DETR, retail YOLOs).
- Try other models like **SAM**, **DINOv2**, or **GroundingDINO** for more advanced segmentation or open-vocabulary detection.
""")
with st.expander("๐ ๏ธ Fine-Tuning an Object Detection Model: Dos, Donโts & Real Effort"):
st.markdown("""
Fine-tuning is the process of **adapting a pretrained model** (like DETR) to recognize **new or custom objects** from your own dataset.
---
###### โ
What You *Should* Do
- **Start with a pretrained model**
Saves time and works well with smaller datasets.
- **Prepare a clean, labeled dataset** โ ๏ธ *Labor-Intensive*
You'll need hundreds or thousands of images **manually annotated with bounding boxes**.
Tools: `LabelImg`, `Roboflow`, `CVAT`.
- **Use transfer learning wisely**
Freeze base layers and train higher layers first to avoid overfitting.
- **Train in small batches initially**
Helps you catch issues early (e.g., wrong labels, overfitting).
- **Use data augmentation**
Automatically increases variation using flips, crops, brightness, etc.
---
###### โ What You *Shouldnโt* Do
- **Donโt use a high learning rate**
It might "unlearn" everything from pretraining.
- **Donโt train from scratch** unless you have a huge dataset
Pretrained models save compute and training time.
- **Donโt skip validation**
Always use a validation set to evaluate generalization.
- **Donโt mismatch model formats**
Your model might expect COCO-style annotations; ensure consistency.
---
###### ๐งช When Should You Fine-Tune?
- You're detecting **custom objects** (e.g., tools, animals, X-ray anomalies).
- Your domain is **very different** (e.g., drones, medical imaging).
- You want **higher accuracy** for specific categories.
---
###### ๐ Most Labor-Intensive Sub-Activity
**โ๏ธ Annotating the dataset** (images + bounding boxes + class labels)
This step takes **significant manual effort** and often requires **domain experts** (e.g., doctors for medical images, engineers for defect detection).
๐ก *Tip:* Use small datasets to experiment, and crowdsource or semi-automate annotation for larger ones.
""")
image_url = "https://raw.githubusercontent.com/gridflowai/gridflowAI-datasets-icons/7ec17a8e039d53a1dac09d22270251e318649457/AI-icons-images/image_bounding_boxes.png"
st.image(image_url, caption="Objects manually annotated with bounding boxes", width=300)
if __name__ == "__main__":
main()
|