Object Detection
Object detection goes beyond classification: it identifies what objects are in an image and where they are located, outputting bounding boxes with class labels.
Object Detection Overview
While image classification assigns a single label to an entire image, object detection identifies and localizes multiple objects within an image. Each detection consists of:
- Bounding box: Coordinates (x, y, width, height) defining the object's location
- Class label: What the object is (person, car, dog, etc.)
- Confidence score: How confident the model is in the detection
Sliding Window Approach
The earliest approach to detection: slide a fixed-size window across the image at multiple scales, running a classifier on each window. This is extremely slow and has been replaced by modern methods, but understanding it helps appreciate why newer approaches were needed.
R-CNN Family
| Model | Year | Approach | Speed |
|---|---|---|---|
| R-CNN | 2014 | Selective Search generates ~2000 region proposals, each processed by a CNN separately | ~47 seconds/image |
| Fast R-CNN | 2015 | Single CNN pass for entire image, ROI pooling extracts features for each proposal | ~2 seconds/image |
| Faster R-CNN | 2015 | Region Proposal Network (RPN) replaces Selective Search, fully neural end-to-end | ~0.2 seconds/image |
YOLO (You Only Look Once)
YOLO revolutionized object detection by treating it as a single regression problem. Instead of proposing regions and classifying them separately, YOLO divides the image into a grid and predicts bounding boxes and class probabilities for each cell in a single forward pass.
| Version | Key Innovation |
|---|---|
| YOLOv5 | PyTorch implementation, easy to use, strong community, multiple model sizes (n/s/m/l/x) |
| YOLOv8 | By Ultralytics. State-of-the-art accuracy, anchor-free detection, unified API for detection/segmentation/classification |
from ultralytics import YOLO # Load a pretrained YOLOv8 model model = YOLO("yolov8n.pt") # nano model (fastest) # Run detection on an image results = model("street.jpg") # Process results for result in results: boxes = result.boxes for box in boxes: cls = int(box.cls[0]) conf = float(box.conf[0]) x1, y1, x2, y2 = box.xyxy[0] print(f"{model.names[cls]}: {conf:.2f} at ({x1:.0f},{y1:.0f})-({x2:.0f},{y2:.0f})") # Save annotated image results[0].save("output.jpg")
SSD (Single Shot Detector)
SSD detects objects at multiple scales by using feature maps from different layers of a CNN. It is faster than Faster R-CNN while maintaining competitive accuracy, making it suitable for real-time applications.
Key Concepts
Anchor Boxes
Predefined bounding box shapes at each grid cell. The model predicts offsets from these anchors rather than absolute coordinates, making training more stable. Different anchors capture objects of different aspect ratios (tall, wide, square).
Non-Maximum Suppression (NMS)
When a model produces multiple overlapping detections for the same object, NMS keeps only the highest-confidence box and removes boxes that overlap significantly (above an IoU threshold).
IoU (Intersection over Union)
The standard metric for measuring bounding box overlap: IoU = Area of Overlap / Area of Union. An IoU of 0.5+ is typically considered a correct detection.
Transfer Learning for Detection
Instead of training from scratch, start with a model pretrained on a large dataset (like COCO with 80 object classes) and fine-tune it on your specific objects:
from ultralytics import YOLO # Load pretrained model model = YOLO("yolov8n.pt") # Fine-tune on custom dataset results = model.train( data="custom_dataset.yaml", # Dataset config epochs=50, imgsz=640, batch=16 ) # Evaluate on validation set metrics = model.val() print(f"mAP50: {metrics.box.map50:.3f}") print(f"mAP50-95: {metrics.box.map:.3f}")
Lilly Tech Systems