Spaces:
Running
Running
import streamlit as st | |
from transformers import DetrImageProcessor, DetrForObjectDetection | |
from PIL import Image, ImageDraw, ImageFont | |
import torch | |
import io | |
# Load model and processor | |
def load_model(): | |
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50") | |
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50") | |
return processor, model | |
def draw_boxes(image, results, labels): | |
draw = ImageDraw.Draw(image) | |
font = ImageFont.load_default() | |
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]): | |
box = [round(i, 2) for i in box.tolist()] | |
draw.rectangle(box, outline="red", width=2) | |
label_text = f"{labels[label.item()]}: {score:.2f}" | |
draw.text((box[0], box[1] - 10), label_text, fill="red", font=font) | |
return image | |
def main(): | |
st.set_page_config(page_title="๐ Object Detection Demo", layout="centered") | |
st.markdown("**๐ฏ Object Detection using Transformers (DETR)**") | |
st.write("Upload an image to detect objects using a pre-trained Transformer model: `facebook/detr-resnet-50`.") | |
uploaded_file = st.file_uploader("Upload an Image", type=["jpg", "jpeg", "png"]) | |
if uploaded_file: | |
image = Image.open(uploaded_file).convert("RGB") | |
# Create two columns for displaying images | |
col1, col2 = st.columns(2) | |
with col1: | |
st.image(image, caption="Original Image", width=200) | |
processor, model = load_model() | |
# Preprocess | |
inputs = processor(images=image, return_tensors="pt") | |
outputs = model(**inputs) | |
# Post-process | |
target_sizes = torch.tensor([image.size[::-1]]) | |
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0] | |
st.markdown("**๐ฆ Detected Objects**") | |
if results["boxes"].shape[0] == 0: | |
st.warning("No objects detected with confidence > 90%") | |
else: | |
labeled_image = image.copy() | |
labeled_image = draw_boxes(labeled_image, results, model.config.id2label) | |
with col2: | |
st.image(labeled_image, caption="Detected Objects", width=200) | |
for score, label in zip(results["scores"], results["labels"]): | |
st.write(f"- **{model.config.id2label[label.item()]}** โ Confidence: `{score:.2f}`") | |
with st.expander("โน๏ธ What is Pretraining and Which Model Are We Using?"): | |
st.markdown(""" | |
**Pretraining** is like teaching a model some basic skills before asking it to do a specific task. | |
Just like a child first learns shapes, colors, and objects before learning to name or sort them, | |
a pre-trained model has already learned to recognize **general patterns** in thousands or millions of images. | |
๐ In our app, we are using a pre-trained model called: | |
### ๐ `facebook/detr-resnet-50` | |
- **DETR** stands for *DEtection TRansformer*. It's a special kind of deep learning model made by Facebook AI. | |
- It's been **trained on COCO dataset**, which includes 80 common object types like people, cars, dogs, chairs, etc. | |
- Because it's pre-trained, we **donโt need to train it ourselves** โ it already knows how to detect these objects! | |
๐ง So when you upload an image, the model just applies what it has already learned during pretraining to spot things in your image. | |
""") | |
with st.expander("โน๏ธ What is Object Detection?"): | |
st.markdown(""" | |
Object detection is like playing "I spy with my little eye" โ but using AI! | |
Instead of just saying "there's a dog", object detection can also say **where** the dog is in the image using a **bounding box**. | |
It helps in: | |
- ๐ป Self-driving cars (detecting pedestrians and vehicles) | |
- ๐ธ Security cameras (detecting intrusions) | |
- ๐ฆ Inventory systems (detecting objects on shelves) | |
""") | |
with st.expander("โน๏ธ How Does the DETR Model Work?"): | |
st.markdown(""" | |
**DETR** stands for **DEtection TRansformer**, a cutting-edge model developed by Facebook AI Research. It combines **Convolutional Neural Networks (CNNs)** and **Transformers** โ the same architecture used in ChatGPT and BERT โ to detect objects in images. | |
###### ๐๏ธ How is it Different? | |
Most older object detection models work in stages: | |
- First, they **generate regions of interest (ROIs)** (like boxes where something might be). | |
- Then they **classify** what's inside each box (cat, dog, etc.). | |
But DETR skips this multi-step process by using a **Transformer** to directly: | |
- Look at the image | |
- Predict **all objects and their locations at once** (end-to-end) | |
###### โ๏ธ Key Components of DETR: | |
- **CNN Backbone (like ResNet):** Extracts visual features from the image (e.g., edges, textures) | |
- **Transformer Encoder-Decoder:** Understands **global relationships** between features (e.g., where objects are in relation to each other) | |
- **Prediction Heads:** Predicts bounding boxes and labels | |
###### โจ Why is DETR Special? | |
- No need for complicated anchor boxes or region proposals | |
- Handles overlapping or cluttered objects better | |
- Learns in a more "human-like" way โ understanding the **whole scene**, not just pieces | |
###### ๐ฆ Pretrained Model in This App: | |
In this app, we're using **`facebook/detr-resnet-50`**, a model trained on **COCO dataset** (Common Objects in Context) with: | |
- 80 object categories (like person, car, bottle, chair) | |
- Over 100,000 images for training | |
It can detect things like: | |
""") | |
col1, col2, col3, col4 = st.columns(4) | |
with col1: | |
st.write("๐ถ Dogs") | |
st.write("๐ฑ Cats") | |
st.write("๐ง People") | |
st.write("๐ Cars") | |
st.write("๐ Buses") | |
st.write("๐ด Bicycles") | |
st.write("๐๏ธ Motorcycles") | |
with col2: | |
st.write("โ๏ธ Airplanes") | |
st.write("๐ค Boats") | |
st.write("๐ช Chairs") | |
st.write("๐๏ธ Beds") | |
st.write("๐ฅ๏ธ Monitors") | |
st.write("๐ฑ Cell Phones") | |
st.write("๐ท Cameras") | |
with col3: | |
st.write("๐ Apples") | |
st.write("๐ Bananas") | |
st.write("๐ Pizzas") | |
st.write("๐ฅซ Cans") | |
st.write("๐ฝ๏ธ Dining Tables") | |
st.write("๐๏ธ Couches") | |
st.write("๐งด Bottles") | |
with col4: | |
st.write("๐ Handbags") | |
st.write("๐งณ Suitcases") | |
st.write("โบ Tents") | |
st.write("๐ผ๏ธ Paintings") | |
st.write("๐ฆ Traffic Lights") | |
st.write("๐ Stop Signs") | |
st.write("๐ Cows") | |
with st.expander("๐ Real-World Use Cases of Object Detection"): | |
st.markdown(""" | |
Object detection models like DETR are widely used in many industries. Here are some practical examples: | |
๐ **Security & Surveillance** | |
- Detecting people in restricted zones | |
- Identifying abandoned objects in public places | |
๐ฅ **Healthcare** | |
- Analyzing X-rays and MRI scans to detect tumors or anomalies | |
- Assisting doctors in surgical planning | |
๐ **Autonomous Vehicles** | |
- Identifying pedestrians, vehicles, traffic lights, and road signs in real-time | |
๐๏ธ **Retail** | |
- Automated checkout systems (e.g., Amazon Go) | |
- Shelf inventory monitoring using cameras | |
๐๏ธ **Construction & Safety** | |
- Monitoring helmet usage and safety compliance on sites | |
- Tracking equipment and workers | |
๐ธ **Aerial & Drone Imagery** | |
- Detecting objects (cars, animals, buildings) from satellite or drone images | |
๐ฑ **Mobile Applications** | |
- Real-time AR object tagging (e.g., identifying products in camera view) | |
๐ฎ **Gaming & Sports** | |
- Player and object tracking in sports analytics | |
- Enhanced real-time visuals in AR/VR environments | |
""") | |
with st.expander("๐ Categories vs Real-World Use Cases"): | |
st.markdown(""" | |
###### ๐ฏ What DETR Can Detect (Pretrained Model) | |
The base DETR model (`facebook/detr-resnet-50`) is trained on the **COCO dataset**, which includes 91 common object categories, such as: | |
- ๐ง Person ๐ Car ๐ Bus ๐๏ธ Motorcycle | |
- ๐ถ Dog ๐ฑ Cat ๐ Cow | |
- ๐ Apple ๐ Banana | |
- ๐๏ธ Sofa ๐ช Chair ๐๏ธ Bed | |
- ๐บ TV ๐ฅ๏ธ Laptop ๐ท Camera | |
- ๐ฆ Bird ๐ Fish | |
This is great for **general object detection**, but there are some gaps when it comes to real-world applications. | |
--- | |
###### โ What It Misses for Real-World Use Cases | |
In specialized or industrial domains, we often need to detect: | |
###### ๐ฅ **Medical Imaging** | |
- Tumors, organs (lungs, liver), anomalies | |
> โ ๏ธ COCO doesnโt have these. | |
###### ๐ก๏ธ **Security/Surveillance** | |
- Weapons, intrusions, suspicious behavior | |
> Not covered in COCO. | |
###### ๐ญ **Manufacturing** | |
- Machine parts, tools, defects | |
> Also outside COCOโs categories. | |
###### ๐งช **Scientific Research** | |
- Cells, molecules, lab equipment | |
###### ๐ฆ **Retail** | |
- Brands, barcodes, product layouts | |
--- | |
###### โ Solutions & Next Steps | |
- **`Fine-tune DETR`** on custom datasets for your domain. | |
- Use **domain-specific pretrained models** (e.g., BioMed DETR, retail YOLOs). | |
- Try other models like **SAM**, **DINOv2**, or **GroundingDINO** for more advanced segmentation or open-vocabulary detection. | |
""") | |
with st.expander("๐ ๏ธ Fine-Tuning an Object Detection Model: Dos, Donโts & Real Effort"): | |
st.markdown(""" | |
Fine-tuning is the process of **adapting a pretrained model** (like DETR) to recognize **new or custom objects** from your own dataset. | |
--- | |
###### โ What You *Should* Do | |
- **Start with a pretrained model** | |
Saves time and works well with smaller datasets. | |
- **Prepare a clean, labeled dataset** โ ๏ธ *Labor-Intensive* | |
You'll need hundreds or thousands of images **manually annotated with bounding boxes**. | |
Tools: `LabelImg`, `Roboflow`, `CVAT`. | |
- **Use transfer learning wisely** | |
Freeze base layers and train higher layers first to avoid overfitting. | |
- **Train in small batches initially** | |
Helps you catch issues early (e.g., wrong labels, overfitting). | |
- **Use data augmentation** | |
Automatically increases variation using flips, crops, brightness, etc. | |
--- | |
###### โ What You *Shouldnโt* Do | |
- **Donโt use a high learning rate** | |
It might "unlearn" everything from pretraining. | |
- **Donโt train from scratch** unless you have a huge dataset | |
Pretrained models save compute and training time. | |
- **Donโt skip validation** | |
Always use a validation set to evaluate generalization. | |
- **Donโt mismatch model formats** | |
Your model might expect COCO-style annotations; ensure consistency. | |
--- | |
###### ๐งช When Should You Fine-Tune? | |
- You're detecting **custom objects** (e.g., tools, animals, X-ray anomalies). | |
- Your domain is **very different** (e.g., drones, medical imaging). | |
- You want **higher accuracy** for specific categories. | |
--- | |
###### ๐ Most Labor-Intensive Sub-Activity | |
**โ๏ธ Annotating the dataset** (images + bounding boxes + class labels) | |
This step takes **significant manual effort** and often requires **domain experts** (e.g., doctors for medical images, engineers for defect detection). | |
๐ก *Tip:* Use small datasets to experiment, and crowdsource or semi-automate annotation for larger ones. | |
""") | |
image_url = "https://raw.githubusercontent.com/gridflowai/gridflowAI-datasets-icons/7ec17a8e039d53a1dac09d22270251e318649457/AI-icons-images/image_bounding_boxes.png" | |
st.image(image_url, caption="Objects manually annotated with bounding boxes", width=300) | |
if __name__ == "__main__": | |
main() | |