Intermediate

Object Detection

Object detection goes beyond classification: it identifies what objects are in an image and where they are located, outputting bounding boxes with class labels.

Object Detection Overview

While image classification assigns a single label to an entire image, object detection identifies and localizes multiple objects within an image. Each detection consists of:

Bounding box: Coordinates (x, y, width, height) defining the object's location
Class label: What the object is (person, car, dog, etc.)
Confidence score: How confident the model is in the detection

Sliding Window Approach

The earliest approach to detection: slide a fixed-size window across the image at multiple scales, running a classifier on each window. This is extremely slow and has been replaced by modern methods, but understanding it helps appreciate why newer approaches were needed.

R-CNN Family

Model	Year	Approach	Speed
R-CNN	2014	Selective Search generates ~2000 region proposals, each processed by a CNN separately	~47 seconds/image
Fast R-CNN	2015	Single CNN pass for entire image, ROI pooling extracts features for each proposal	~2 seconds/image
Faster R-CNN	2015	Region Proposal Network (RPN) replaces Selective Search, fully neural end-to-end	~0.2 seconds/image

YOLO (You Only Look Once)

YOLO revolutionized object detection by treating it as a single regression problem. Instead of proposing regions and classifying them separately, YOLO divides the image into a grid and predicts bounding boxes and class probabilities for each cell in a single forward pass.

Version	Key Innovation
YOLOv5	PyTorch implementation, easy to use, strong community, multiple model sizes (n/s/m/l/x)
YOLOv8	By Ultralytics. State-of-the-art accuracy, anchor-free detection, unified API for detection/segmentation/classification

Python - YOLOv8 Detection

from ultralytics import YOLO

# Load a pretrained YOLOv8 model
model = YOLO("yolov8n.pt")  # nano model (fastest)

# Run detection on an image
results = model("street.jpg")

# Process results
for result in results:
    boxes = result.boxes
    for box in boxes:
        cls = int(box.cls[0])
        conf = float(box.conf[0])
        x1, y1, x2, y2 = box.xyxy[0]
        print(f"{model.names[cls]}: {conf:.2f} at ({x1:.0f},{y1:.0f})-({x2:.0f},{y2:.0f})")

# Save annotated image
results[0].save("output.jpg")

SSD (Single Shot Detector)

SSD detects objects at multiple scales by using feature maps from different layers of a CNN. It is faster than Faster R-CNN while maintaining competitive accuracy, making it suitable for real-time applications.

Key Concepts

Anchor Boxes

Predefined bounding box shapes at each grid cell. The model predicts offsets from these anchors rather than absolute coordinates, making training more stable. Different anchors capture objects of different aspect ratios (tall, wide, square).

Non-Maximum Suppression (NMS)

When a model produces multiple overlapping detections for the same object, NMS keeps only the highest-confidence box and removes boxes that overlap significantly (above an IoU threshold).

IoU (Intersection over Union)

The standard metric for measuring bounding box overlap: IoU = Area of Overlap / Area of Union. An IoU of 0.5+ is typically considered a correct detection.

Transfer Learning for Detection

Instead of training from scratch, start with a model pretrained on a large dataset (like COCO with 80 object classes) and fine-tune it on your specific objects:

Python - Fine-tuning YOLOv8

from ultralytics import YOLO

# Load pretrained model
model = YOLO("yolov8n.pt")

# Fine-tune on custom dataset
results = model.train(
    data="custom_dataset.yaml",  # Dataset config
    epochs=50,
    imgsz=640,
    batch=16
)

# Evaluate on validation set
metrics = model.val()
print(f"mAP50: {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")

✅

Key takeaway: Object detection has evolved from slow multi-stage approaches (R-CNN) to fast single-shot detectors (YOLO, SSD). Modern models like YOLOv8 can detect objects in real-time with high accuracy and are easy to fine-tune on custom datasets.

← Previous Image Processing Next → Image Classification