Intermediate

Object Detection Questions

These 12 questions cover object detection concepts tested in CV interviews at companies building autonomous vehicles, robotics, surveillance, and content understanding systems. Expect deep questions on architecture differences, evaluation metrics, and real-time constraints.

Q1: What is IoU (Intersection over Union) and how is it used in object detection?

💡
Model Answer:

IoU measures the overlap between a predicted bounding box and a ground truth box: IoU = Area(Intersection) / Area(Union). It ranges from 0 (no overlap) to 1 (perfect overlap).

Uses in object detection:

  • Matching predictions to ground truth: A prediction is a True Positive if IoU ≥ threshold (typically 0.5 for PASCAL VOC, 0.5:0.95 for COCO)
  • Non-Maximum Suppression: Suppresses duplicate detections by removing boxes with IoU above a threshold with a higher-scoring box
  • Anchor assignment: During training, anchors with IoU ≥ 0.5 with a ground truth box are assigned as positive examples
import torch

def compute_iou(box1, box2):
    """Compute IoU between two sets of boxes.
    Args:
        box1: (N, 4) tensor [x1, y1, x2, y2]
        box2: (M, 4) tensor [x1, y1, x2, y2]
    Returns:
        (N, M) IoU matrix
    """
    # Intersection coordinates
    x1 = torch.max(box1[:, None, 0], box2[None, :, 0])
    y1 = torch.max(box1[:, None, 1], box2[None, :, 1])
    x2 = torch.min(box1[:, None, 2], box2[None, :, 2])
    y2 = torch.min(box1[:, None, 3], box2[None, :, 3])

    intersection = (x2 - x1).clamp(min=0) * (y2 - y1).clamp(min=0)

    area1 = (box1[:, 2] - box1[:, 0]) * (box1[:, 3] - box1[:, 1])
    area2 = (box2[:, 2] - box2[:, 0]) * (box2[:, 3] - box2[:, 1])

    union = area1[:, None] + area2[None, :] - intersection
    return intersection / (union + 1e-6)

Q2: Compare two-stage vs one-stage object detectors.

💡
Model Answer:
AspectTwo-Stage (Faster R-CNN)One-Stage (YOLO, SSD)
ArchitectureStage 1: Region Proposal Network (RPN) generates ~2000 proposals. Stage 2: Classifies and refines each proposalSingle network predicts class and box directly from feature maps in one pass
SpeedSlower (5–15 FPS typically). Two sequential stagesFaster (30–200+ FPS). Single forward pass
AccuracyGenerally higher, especially for small objectsSlightly lower historically, but gap has narrowed significantly (YOLOv8 matches Faster R-CNN)
TrainingMore complex (multi-task loss, proposal sampling)Simpler end-to-end training
Use CaseWhen accuracy is paramount and latency budget is generous (medical imaging, satellite analysis)Real-time applications: autonomous driving, video surveillance, robotics

Modern trend: The distinction has blurred. DETR (Detection Transformer) uses a set-based approach with no anchors or NMS. RT-DETR achieves real-time performance with transformer-based detection. YOLOv8/v9 match or exceed two-stage detectors on COCO while running at 100+ FPS.

Q3: Explain how Faster R-CNN works end to end.

💡
Model Answer:

Faster R-CNN consists of three main components:

  1. Backbone (Feature Extractor): A CNN (typically ResNet-50 with FPN) extracts feature maps from the input image. The backbone is shared between both stages.
  2. Region Proposal Network (RPN):
    • Slides a 3x3 convolution over the feature map
    • At each position, predicts objectness score (object vs background) and bounding box offsets for K anchor boxes (typically K=9: 3 scales x 3 aspect ratios)
    • Anchors with IoU ≥ 0.7 with any ground truth are positive; IoU < 0.3 are negative
    • After NMS, top ~300 proposals are passed to the second stage
  3. Detection Head (RoI Head):
    • RoI Pooling (or RoI Align) extracts fixed-size feature maps for each proposal from the backbone features
    • Two FC layers predict: (a) class probabilities for C+1 classes (including background), (b) bounding box refinement offsets
    • Final NMS removes duplicate detections per class

Loss function: Multi-task loss = L_cls(RPN) + L_reg(RPN) + L_cls(Head) + L_reg(Head). Classification uses cross-entropy; regression uses smooth L1 loss.

RoI Align vs RoI Pooling: RoI Pooling quantizes coordinates to integer positions, causing misalignment. RoI Align uses bilinear interpolation to sample at exact positions, improving accuracy by ~1–2 mAP points. Always use RoI Align.

Q4: How does YOLO work? Trace the evolution from YOLOv1 to YOLOv8.

💡
Model Answer:

YOLOv1 (2016): Divides image into S×S grid. Each cell predicts B bounding boxes and C class probabilities. Single forward pass through the entire image. Fast but poor on small objects and multiple objects per grid cell.

Key improvements across versions:

VersionKey InnovationCOCO mAP
YOLOv2Batch norm, anchor boxes, multi-scale training, passthrough layers~33
YOLOv3Multi-scale predictions at 3 scales (FPN-like), Darknet-53 backbone, binary cross-entropy per class~36
YOLOv4CSPDarknet backbone, SPP, PANet neck, Mish activation, mosaic augmentation, CIoU loss~43
YOLOv5PyTorch implementation (Ultralytics), auto-anchor computation, extensive augmentation pipeline, model scaling (S/M/L/X)~45
YOLOv8Anchor-free design, decoupled detection head, C2f blocks (CSP with Flow), task-aligned assigner, DFL loss~53

YOLOv8 architecture in detail:

  • Backbone: CSPDarknet with C2f (Cross Stage Partial with Flow) modules that concatenate outputs of multiple bottleneck blocks
  • Neck: PANet (Path Aggregation Network) with top-down and bottom-up feature fusion
  • Head: Decoupled head (separate branches for classification and regression). Anchor-free: predicts center offset and width/height directly
  • Loss: CIoU loss for boxes + Distribution Focal Loss (DFL) + binary cross-entropy for classification

Q5: What are anchor boxes and why are they used?

💡
Model Answer:

Anchor boxes are predefined bounding boxes of various sizes and aspect ratios placed at each spatial location in the feature map. Instead of predicting absolute box coordinates, the model predicts offsets relative to these anchors.

Why anchors help:

  • The model only needs to learn small adjustments rather than absolute coordinates, which is easier to optimize
  • Multiple anchors per location handle objects of different scales and aspect ratios at the same position
  • They provide a well-defined matching scheme between predictions and ground truth during training

Typical anchor configurations:

  • Faster R-CNN: 3 scales (128, 256, 512) x 3 aspect ratios (1:1, 1:2, 2:1) = 9 anchors per position
  • SSD: Different anchors at different feature map scales (from 4 to 6 per location)
  • YOLOv5: 3 anchors per scale, learned via k-means clustering on training data

Anchor-free alternatives: Modern detectors (YOLOv8, FCOS, CenterNet) predict objects without predefined anchors. FCOS predicts distances from each point to the four sides of the bounding box. CenterNet predicts object centers as heatmap peaks. Anchor-free designs are simpler and avoid anchor hyperparameter tuning.

Q6: Implement Non-Maximum Suppression (NMS) from scratch.

💡
Model Answer:

NMS removes duplicate detections by keeping only the highest-confidence box among overlapping predictions.

import torch

def nms(boxes, scores, iou_threshold=0.5):
    """Non-Maximum Suppression.
    Args:
        boxes: (N, 4) tensor [x1, y1, x2, y2]
        scores: (N,) confidence scores
        iou_threshold: suppress boxes with IoU above this
    Returns:
        keep: indices of boxes to keep
    """
    # Sort by confidence (descending)
    order = scores.argsort(descending=True)
    keep = []

    while order.numel() > 0:
        # Pick the highest-scoring box
        i = order[0].item()
        keep.append(i)

        if order.numel() == 1:
            break

        # Compute IoU of this box with all remaining boxes
        remaining = order[1:]
        xx1 = torch.max(boxes[i, 0], boxes[remaining, 0])
        yy1 = torch.max(boxes[i, 1], boxes[remaining, 1])
        xx2 = torch.min(boxes[i, 2], boxes[remaining, 2])
        yy2 = torch.min(boxes[i, 3], boxes[remaining, 3])

        intersection = (xx2 - xx1).clamp(min=0) * (yy2 - yy1).clamp(min=0)

        area_i = (boxes[i, 2] - boxes[i, 0]) * (boxes[i, 3] - boxes[i, 1])
        area_rem = ((boxes[remaining, 2] - boxes[remaining, 0]) *
                    (boxes[remaining, 3] - boxes[remaining, 1]))
        union = area_i + area_rem - intersection
        iou = intersection / (union + 1e-6)

        # Keep boxes with IoU below threshold
        mask = iou <= iou_threshold
        order = remaining[mask]

    return torch.tensor(keep)


# Example usage
boxes = torch.tensor([
    [100, 100, 210, 210],  # box 0
    [105, 105, 215, 215],  # box 1 (overlaps with 0)
    [300, 300, 400, 400],  # box 2 (separate)
], dtype=torch.float)
scores = torch.tensor([0.9, 0.75, 0.8])
kept = nms(boxes, scores, iou_threshold=0.5)
print(f"Kept indices: {kept}")  # [0, 2] - box 1 suppressed

Variants:

  • Soft-NMS: Instead of hard removal, decays scores of overlapping boxes: score *= exp(-IoU^2 / sigma). Better for crowded scenes where objects occlude each other.
  • Batched NMS: Applies NMS independently per class by adding a class offset to coordinates.
  • Matrix NMS: Parallelizable version used in SOLO/SOLOv2 for instance segmentation.

Q7: What is mAP (mean Average Precision) and how is it computed?

💡
Model Answer:

mAP is the primary metric for evaluating object detection models. The computation involves multiple steps:

  1. Match detections to ground truth: For each detection, check if IoU with a ground truth box exceeds the threshold. Each ground truth can only be matched once (highest confidence first).
  2. Compute precision and recall at each detection: Sort detections by confidence. At each detection, precision = TP / (TP + FP), recall = TP / total_GT.
  3. Compute AP per class: Plot the precision-recall curve. AP is the area under this curve (using 11-point or all-point interpolation).
  4. Compute mAP: Average AP across all classes.

PASCAL VOC vs COCO metrics:

MetricPASCAL VOCCOCO
IoU thresholdSingle: 0.5Multiple: 0.5, 0.55, ..., 0.95 (10 thresholds)
NotationmAP@0.5 (or just mAP)AP (primary), AP@50, AP@75 (strict)
Size splitsNoneAP_small (<32x32), AP_medium (32–96), AP_large (>96)
DifficultyLenient (0.5 IoU is easy)Strict (averaging to 0.95 requires precise localization)

Interview tip: When someone says "mAP of 53" without specifying the protocol, it is almost certainly COCO AP (averaged over IoU 0.5:0.95). Always clarify the evaluation protocol.

Q8: What is a Feature Pyramid Network (FPN) and why is it important?

💡
Model Answer:

FPN creates a multi-scale feature pyramid with strong semantics at all scales by combining a bottom-up pathway (backbone) with a top-down pathway and lateral connections.

How it works:

  1. Bottom-up pathway: Standard CNN backbone (ResNet) produces feature maps at decreasing spatial resolutions: C2, C3, C4, C5 (1/4, 1/8, 1/16, 1/32 of input)
  2. Top-down pathway: Starting from C5, upsample by 2x using nearest-neighbor interpolation
  3. Lateral connections: 1x1 convolution on each Ci to match channel dimensions, then element-wise addition with the upsampled feature from above
  4. Output: P2, P3, P4, P5 feature maps, each with the same channel depth (typically 256) but different spatial resolutions

Why it matters:

  • Small objects are detected on high-resolution features (P2, P3) which have fine spatial detail but enriched with high-level semantics from the top-down path
  • Large objects use lower-resolution features (P4, P5) with larger receptive fields
  • Without FPN, small object detection suffers because high-level features have lost spatial detail through downsampling

Variants: PANet adds a bottom-up path after FPN for better low-level feature propagation. BiFPN (EfficientDet) uses weighted bidirectional fusion. NAS-FPN uses neural architecture search to find optimal feature fusion topology.

Q9: What is SSD (Single Shot MultiBox Detector) and how does it differ from YOLO?

💡
Model Answer:

SSD predicts detections from multiple feature maps at different scales in a single forward pass. Unlike YOLOv1 which only used one feature map, SSD makes predictions from 6 different resolutions (from 38x38 down to 1x1).

Key differences from YOLO:

AspectSSDYOLO (v1-v3)
Multi-scale detectionPredictions from 6 feature maps of different sizesYOLOv1: single scale. YOLOv3+: 3 scales via FPN-like structure
Anchor boxesPredefined anchors with hand-picked aspect ratios at each scaleYOLOv2+: learned anchors via k-means. YOLOv8: anchor-free
BackboneVGG-16 (original) with extra conv layers. SSD512 for higher resolutionCustom Darknet architectures optimized for speed
Small objectsBetter due to early feature map predictions (38x38)Historically weaker, improved in v3+ with FPN
LossHard negative mining (3:1 neg:pos ratio) + smooth L1Custom per-version, focal loss variants in later versions

Modern relevance: SSD's multi-scale prediction idea has been adopted by nearly all modern detectors. MobileSSD (SSD with MobileNet backbone) remains popular for edge deployment due to its simple architecture and good speed-accuracy trade-off.

Q10: What loss functions are used in object detection?

💡
Model Answer:

Object detection uses a multi-task loss combining classification and localization:

Classification losses:

  • Cross-entropy: Standard for multi-class classification in two-stage detectors
  • Binary cross-entropy: Per-class independent prediction (multi-label, used in YOLOv3+)
  • Focal Loss: FL(p_t) = -alpha * (1-p_t)^gamma * log(p_t). Down-weights easy negatives (background), focusing training on hard examples. Introduced in RetinaNet to handle extreme foreground-background imbalance in one-stage detectors. Gamma=2, alpha=0.25 are standard.

Localization losses:

  • Smooth L1 (Huber loss): Less sensitive to outliers than L2. Used in Faster R-CNN. L = 0.5*x^2 if |x|<1 else |x|-0.5
  • IoU Loss: Directly optimizes IoU. L = 1 - IoU. Better aligns with the evaluation metric.
  • GIoU Loss: Handles non-overlapping boxes (IoU=0 gives zero gradient). Includes the area of the smallest enclosing box.
  • CIoU Loss: Adds aspect ratio penalty and center distance penalty to GIoU. Used in YOLOv5+. L = 1 - IoU + distance_penalty + aspect_ratio_penalty

Modern trend: Most state-of-the-art detectors use CIoU or DIoU for regression, focal loss or quality focal loss for classification, and task-aligned label assignment instead of IoU-based anchor matching.

Q11: How does DETR (Detection Transformer) work? What makes it different?

💡
Model Answer:

DETR reformulates object detection as a set prediction problem using a transformer architecture, eliminating the need for anchors, NMS, and hand-crafted components.

Architecture:

  1. CNN backbone: Extracts feature maps (ResNet-50 typically)
  2. Transformer encoder: Self-attention over flattened feature map positions with positional encodings. Captures global context.
  3. Transformer decoder: N learned object queries (typically N=100) attend to encoder outputs. Each query specializes in detecting objects in certain positions/scales.
  4. Prediction heads: Each query output is fed to FFN heads that predict class + bounding box (or "no object").

Training: Uses Hungarian matching (bipartite matching) to find the optimal one-to-one assignment between N predictions and ground truth objects. This eliminates the need for NMS.

Advantages: No anchors, no NMS, no hand-crafted components. Simpler code. Global reasoning via self-attention. Good at avoiding duplicate detections.

Limitations: Slow convergence (500 epochs vs 36 for Faster R-CNN). Poor performance on small objects. High memory cost of attention on high-resolution features.

Follow-ups: Deformable DETR fixes convergence and small object issues by using deformable attention (only attends to a small set of sampling points). RT-DETR achieves real-time performance. Co-DETR uses collaborative learning to improve performance further.

Q12: Write code to run inference with a pretrained object detection model.

💡
Model Answer:
import torch
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2
from torchvision.transforms import functional as F
from PIL import Image

def detect_objects(image_path, score_threshold=0.5):
    """Run object detection on an image using Faster R-CNN."""
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Load pretrained model (COCO: 91 classes)
    model = fasterrcnn_resnet50_fpn_v2(
        weights=torchvision.models.detection
        .FasterRCNN_ResNet50_FPN_V2_Weights.COCO_V1
    )
    model = model.to(device).eval()

    # Load and preprocess image
    image = Image.open(image_path).convert("RGB")
    image_tensor = F.to_tensor(image).unsqueeze(0).to(device)

    # Run inference
    with torch.no_grad():
        predictions = model(image_tensor)[0]

    # Filter by confidence
    mask = predictions["scores"] >= score_threshold
    boxes = predictions["boxes"][mask]
    labels = predictions["labels"][mask]
    scores = predictions["scores"][mask]

    # COCO class names (subset)
    COCO_NAMES = [
        "__background__", "person", "bicycle", "car", "motorcycle",
        "airplane", "bus", "train", "truck", "boat", "traffic light",
        "fire hydrant", "N/A", "stop sign", "parking meter", "bench",
        # ... 91 total classes
    ]

    for box, label, score in zip(boxes, labels, scores):
        x1, y1, x2, y2 = box.int().tolist()
        name = COCO_NAMES[label] if label < len(COCO_NAMES) else f"class_{label}"
        print(f"{name}: {score:.2f} at [{x1}, {y1}, {x2}, {y2}]")

    return boxes, labels, scores


# Using Ultralytics YOLOv8 (simpler API)
from ultralytics import YOLO

model = YOLO("yolov8n.pt")  # nano model for speed
results = model("image.jpg", conf=0.5)

for r in results:
    for box in r.boxes:
        cls = int(box.cls[0])
        conf = float(box.conf[0])
        x1, y1, x2, y2 = box.xyxy[0].int().tolist()
        print(f"{r.names[cls]}: {conf:.2f} at [{x1},{y1},{x2},{y2}]")