Advanced

Practical CV Challenges

These 10 questions test production-level computer vision knowledge. Interviewers at companies like Tesla, Waymo, Amazon, and Meta want to see that you can deploy and maintain CV systems at scale, not just train models in notebooks.

Q1: How do you optimize a CV model for real-time inference?

💡

Model Answer:

Real-time CV typically means ≤33ms per frame (30 FPS) or ≤16ms (60 FPS). Here is the optimization hierarchy, ordered by impact:

Model architecture selection: Choose an efficient backbone from the start. MobileNetV3 is 10x faster than ResNet-50 with ~5% accuracy drop. EfficientNet provides the best accuracy-speed trade-off. For detection, YOLOv8-nano (3.2M params) vs YOLOv8-xlarge (68M params).
Input resolution: Reducing from 640x640 to 320x320 gives ~4x speedup. Often the single biggest lever. Find the minimum resolution that meets your accuracy requirements.
Quantization:
- FP16 (half precision): ~2x speedup on GPUs with tensor cores. Negligible accuracy loss. Always do this.
- INT8: ~4x speedup. Requires calibration dataset (representative samples). Typically <1% accuracy loss with proper calibration.
- INT4: Aggressive. Useful for edge devices. May need quantization-aware training (QAT) to maintain quality.
TensorRT / ONNX Runtime: Graph optimization (operator fusion, kernel auto-tuning). TensorRT typically gives 2–5x speedup over vanilla PyTorch.
Knowledge distillation: Train a small student model to mimic a large teacher. Preserves more accuracy than simply using a smaller architecture.
Pruning: Remove low-magnitude weights or entire channels. Structured pruning (removing entire filters) is hardware-friendly. Can remove 50–70% of parameters with <2% accuracy loss.

# Export PyTorch model to ONNX, then optimize with TensorRT
import torch

model = torch.load("model.pth").eval().cuda()
dummy_input = torch.randn(1, 3, 640, 640).cuda()

# Step 1: Export to ONNX
torch.onnx.export(
    model, dummy_input, "model.onnx",
    input_names=["input"], output_names=["output"],
    dynamic_axes={"input": {0: "batch"}, "output": {0: "batch"}},
    opset_version=17
)

# Step 2: Convert ONNX to TensorRT (via trtexec CLI)
# trtexec --onnx=model.onnx --saveEngine=model.engine \
#         --fp16 --workspace=4096

# Step 3: INT8 quantization with calibration
# trtexec --onnx=model.onnx --saveEngine=model_int8.engine \
#         --int8 --calib=calibration_data/

Q2: Compare edge deployment platforms for CV models.

💡

Model Answer:

Platform	Hardware	Performance	Best For
NVIDIA Jetson Orin	ARM + GPU (up to 275 TOPS INT8)	High. Runs YOLO at 60+ FPS	Autonomous vehicles, robotics, industrial inspection. Full CUDA support
Google Coral / Edge TPU	Custom ASIC (4 TOPS INT8)	Moderate. Limited to TFLite models	Low-power IoT devices, smart cameras. Very power efficient (2W)
Intel OpenVINO	Intel CPUs, iGPUs, VPUs (Movidius)	Good on Intel hardware	Enterprise deployments with Intel infrastructure. Wide model support
Qualcomm Snapdragon	Hexagon DSP + Adreno GPU	Good for mobile	Mobile phones, AR glasses. Optimized for on-device inference
Apple Neural Engine	Custom NPU (up to 38 TOPS)	Excellent for CoreML models	iOS/macOS apps. Best-in-class efficiency for Apple devices

Deployment workflow:

Train in PyTorch on GPU cluster
Export to ONNX (universal intermediate format)
Convert to target-specific format: TensorRT (NVIDIA), TFLite (Coral), OpenVINO IR (Intel), CoreML (Apple)
Quantize to INT8 with calibration data from target domain
Benchmark on target hardware with realistic input pipeline

Common pitfall: GPU benchmarks do not reflect edge performance. A model that runs at 100 FPS on an A100 may run at 5 FPS on a Jetson Nano. Always benchmark on the actual target device early in the project.

Q3: How do you design a data labeling pipeline for a CV project?

💡

Model Answer:

Data labeling is typically the bottleneck and largest cost in CV projects. A well-designed pipeline directly impacts model quality.

Pipeline design:

Define labeling guidelines: Create a detailed document with examples of positive/negative cases, edge cases, and ambiguous situations. Include visual examples. Iterate guidelines with a small pilot before full-scale labeling.
Choose labeling tool: CVAT (open source, self-hosted), Label Studio (flexible, multi-task), Labelbox/Scale AI (managed, expensive but high quality). For segmentation, use SAM-assisted labeling to pre-generate masks that annotators refine.
Quality control:
- Multi-annotator overlap: Have 2–3 annotators label the same images. Measure inter-annotator agreement (Cohen's kappa for classification, IoU for boxes/masks).
- Gold standard examples: Mix in pre-labeled "test" images to monitor annotator accuracy continuously.
- Review queue: Senior annotators review a random sample (10–20%) of all labels.
Active learning: Use the current model to identify images where it is most uncertain (highest entropy, closest to decision boundary). Prioritize labeling those images. This provides 3–5x more value per labeled image than random sampling.
Versioning: Version your datasets (DVC, LakeFS, or simple git-tracked manifests). Track which data version trained which model. Essential for reproducibility and debugging.

Cost benchmarks (2024): Image classification: $0.02–0.05/image. Bounding boxes: $0.10–0.30/box. Instance segmentation masks: $0.50–2.00/mask. Medical image annotation by experts: $5–50/image.

Q4: What is model quantization? Explain PTQ vs QAT.

💡

Model Answer:

Quantization reduces numerical precision of model weights and activations from FP32 to lower bit-widths (FP16, INT8, INT4), reducing memory, computation, and power consumption.

Aspect	PTQ (Post-Training Quantization)	QAT (Quantization-Aware Training)
Process	Quantize a pretrained model using a small calibration dataset (100–1000 images)	Insert fake quantization nodes during training. Model learns to be robust to quantization noise
Accuracy loss	Typically <1% for INT8. Can be significant for INT4	Minimal even for INT4. Recovers most accuracy lost by PTQ
Effort	Minutes to hours. No retraining needed	Full retraining required (20–50% of original training time)
When to use	First attempt. Works well for most models at INT8	When PTQ accuracy drop is unacceptable, or for aggressive quantization (INT4/binary)

Quantization schemes:

Symmetric: Maps [-max, max] to [-127, 127]. Simpler, used for weights.
Asymmetric: Maps [min, max] to [0, 255]. Better for activations (often non-negative after ReLU).
Per-tensor: Single scale/zero-point for entire tensor. Faster but less precise.
Per-channel: Separate scale/zero-point per output channel. More accurate, standard for weights.

import torch
from torch.quantization import quantize_dynamic, prepare, convert

# Dynamic quantization (simplest, CPU only)
quantized_model = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

# Static PTQ (better accuracy, needs calibration)
model.eval()
model.qconfig = torch.quantization.get_default_qconfig("x86")
model_prepared = prepare(model)

# Run calibration data through the model
with torch.no_grad():
    for images, _ in calibration_loader:
        model_prepared(images)

model_quantized = convert(model_prepared)

# Compare sizes
import os
torch.save(model.state_dict(), "fp32.pth")
torch.save(model_quantized.state_dict(), "int8.pth")
print(f"FP32: {os.path.getsize('fp32.pth') / 1e6:.1f} MB")
print(f"INT8: {os.path.getsize('int8.pth') / 1e6:.1f} MB")
# Typically ~4x smaller

Q5: How do you handle domain shift in production CV systems?

💡

Model Answer:

Domain shift occurs when the distribution of production data differs from training data. In CV, this is extremely common: lighting changes, new camera models, weather conditions, new product types, seasonal variations.

Detection strategies:

Confidence monitoring: Track average prediction confidence over time. A drop in confidence indicates the model is seeing unfamiliar data.
Feature drift detection: Monitor the distribution of intermediate features (penultimate layer activations). Use statistical tests (KL divergence, MMD) or embedding visualization to detect drift.
Error rate monitoring: If ground truth is available (even delayed), track accuracy/mAP trends. Set alerts for statistically significant drops.
Out-of-distribution (OOD) detection: Use energy scores, Mahalanobis distance, or a separate OOD detector to flag inputs that are far from the training distribution.

Mitigation strategies:

Diverse training data: Include multiple domains, conditions, and edge cases during initial training. Use synthetic data (rendered scenes, style transfer) to increase diversity.
Domain adaptation: Fine-tune on unlabeled target domain data using techniques like DANN (Domain-Adversarial Neural Networks) or pseudo-labeling.
Test-time augmentation (TTA): Run inference on multiple augmented versions of each input and aggregate predictions. Improves robustness to minor distribution shifts at the cost of latency.
Continuous learning: Periodically retrain on recent production data (with quality-checked labels). Use a data flywheel: model makes predictions → humans verify edge cases → verified data improves the model.

Q6: Design a real-time video analytics pipeline for a security camera system.

💡

Model Answer:

This is a system design question. Structure your answer with clear components:

Requirements:

100 cameras, 1080p at 30 FPS each
Person detection, tracking, and activity recognition
Alerts within 5 seconds of anomalous activity
Store events for 30 days, raw video for 7 days

Architecture:

Edge processing (per camera or per cluster): NVIDIA Jetson Orin handles 4–8 cameras. Runs YOLOv8-small for detection + ByteTrack for tracking. Reduces bandwidth by only sending events (detected persons, anomalies), not raw video, to the cloud.
Video ingestion: RTSP streams → FFmpeg for decoding → frame queue. Process every Nth frame (skip frames during low-activity periods). Adaptive frame rate based on detected activity.
Detection pipeline: Batch frames from multiple cameras. Run detection on GPU with batch inference for throughput. NMS + tracking per camera stream.
Activity recognition: For tracked persons, extract short clips (16 frames). Run lightweight action recognition model (MoViNet or X3D-S) to classify activities.
Alert system: Rule engine + ML-based anomaly detector. Rules: person in restricted zone, loitering (>5 min same area), running. ML: autoencoder trained on normal behavior, flags high reconstruction error.
Storage: Raw video to object storage (S3) with lifecycle policies. Events and metadata to database (PostgreSQL + pgvector for embedding search). Thumbnails and clips for quick review.
Monitoring: Dashboard showing camera health, model latency P50/P95/P99, detection counts, false positive rate. Alert on camera failure, model degradation, or unusual patterns.

Scaling: Each Jetson Orin handles ~8 cameras. For 100 cameras, need ~13 edge nodes. Central GPU server (1x A100) for batch activity recognition and model updates.

Q7: How do you evaluate and improve a CV model in production?

💡

Model Answer:

Production evaluation goes beyond offline metrics. A model that scores well on a test set may fail in production due to domain shift, edge cases, or integration issues.

Evaluation framework:

Offline evaluation (pre-deployment): Standard metrics (mAP, mIoU, accuracy) on held-out test set. Slice analysis: break down performance by subgroups (day vs night, close vs far, small vs large objects). Robustness testing: evaluate on corrupted images (noise, blur, weather).
Shadow deployment: Run the new model alongside the current production model. Compare predictions on live data without serving the new model's outputs to users. Measure agreement rate and investigate disagreements.
A/B testing: Serve the new model to a subset of traffic (5–10%). Measure business metrics (detection rate, false alarm rate, user feedback). Run for 1–2 weeks for statistical significance.
Continuous monitoring: Track inference latency (P50, P95, P99), prediction distribution, confidence scores, error rates. Set up automated alerts for anomalies.

Improvement cycle:

Analyze failure modes: collect and categorize false positives and false negatives
Targeted data collection: gather more data for failing categories or conditions
Error analysis tools: build a UI where domain experts can review predictions, flag errors, and provide corrections
Retrain on expanded dataset, fine-tune on hard examples
Repeat: this data flywheel is the primary driver of improvement in production CV

Q8: What is knowledge distillation and how do you apply it to CV models?

💡

Model Answer:

Knowledge distillation trains a smaller "student" model to mimic the outputs of a larger "teacher" model, transferring the teacher's learned knowledge into a more efficient architecture.

How it works:

Soft labels: Instead of training the student on hard ground truth labels (one-hot), use the teacher's softmax output (probability distribution). Soft labels contain richer information — they encode inter-class similarities (e.g., "this is probably a cat, but slightly looks like a lynx").
Temperature scaling: Divide logits by temperature T before softmax: softmax(logits/T). Higher T (e.g., 4–20) produces softer distributions that reveal more dark knowledge.
Loss: L = alpha * KL(student_soft, teacher_soft) + (1-alpha) * CE(student, ground_truth). Alpha is typically 0.5–0.9. The KL divergence term transfers knowledge; the CE term keeps the student grounded in true labels.

CV-specific distillation techniques:

Feature distillation: Match intermediate features, not just outputs. The student learns to produce similar feature maps at each layer. Used in detection (mimicking FPN features) and segmentation.
Attention transfer: Match the spatial attention maps (where the model looks) between teacher and student.
Relational distillation: Match the relationships between sample embeddings, not individual outputs. More robust to architecture differences.

import torch
import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, labels,
                       temperature=4.0, alpha=0.7):
    """Combined distillation + classification loss."""
    # Soft target loss (KL divergence on softened predictions)
    soft_student = F.log_softmax(student_logits / temperature, dim=1)
    soft_teacher = F.softmax(teacher_logits / temperature, dim=1)
    kd_loss = F.kl_div(soft_student, soft_teacher, reduction="batchmean")
    kd_loss *= temperature ** 2  # scale gradient magnitude

    # Hard target loss (standard classification)
    ce_loss = F.cross_entropy(student_logits, labels)

    return alpha * kd_loss + (1 - alpha) * ce_loss

Q9: How do you build a robust data augmentation pipeline for a specific CV task?

💡

Model Answer:

Data augmentation must be tailored to the task and domain. Blindly applying random transforms can hurt performance if they create unrealistic samples or destroy task-relevant information.

Design principles:

Match real-world variation: If your cameras face lighting changes, augment with brightness/contrast. If objects appear at various angles, use rotation. If scale varies, use random resize/crop.
Preserve task-relevant information: Do not horizontally flip chest X-rays (laterality matters). Do not color-jitter pathology slides where stain color is diagnostic. Do not rotate text if orientation matters.
Match augmentation to architecture: Heavier augmentation for larger models (they can absorb more). Light augmentation for small models (too much noise hurts them).

Task-specific recommendations:

Task	Recommended Augmentations	Avoid
Medical imaging	Elastic deformation, rotation, scaling, intensity shifts, patch-based training	Color jitter (stain colors are meaningful), heavy crops (might remove lesion)
Autonomous driving	Rain/fog/snow synthesis, day-night style transfer, camera intrinsics variation	Vertical flip (sky should be up), extreme rotation
Satellite imagery	Rotation (0/90/180/270), random crop, atmospheric haze simulation	Horizontal flip OK. Avoid perspective transforms (satellites are nadir-looking)
Industrial inspection	Lighting variation, defect copy-paste, rotation, background replacement	Heavy color changes if defect color is diagnostic

Advanced technique — Copy-Paste augmentation: Cut objects from one image and paste them onto another image's background. Especially effective for instance segmentation. Used in YOLO and Mask R-CNN training. Significant improvement for rare object classes.

Q10: How do you handle multi-GPU and distributed training for large CV models?

💡

Model Answer:

Large CV models (ViT-Large, Swin-L) and big datasets require multi-GPU training. The two main strategies are data parallelism and model parallelism.

Data Parallel (most common for CV):

Each GPU holds a full copy of the model
Each GPU processes a different mini-batch
Gradients are synchronized (all-reduce) across GPUs after each step
PyTorch DDP: DistributedDataParallel — the standard. Overlaps gradient communication with backward pass for efficiency. Use torchrun to launch.

Key considerations:

Effective batch size: If each GPU processes 32 images and you have 8 GPUs, effective batch size is 256. Scale learning rate linearly: lr = base_lr * num_gpus (with warmup).
Batch normalization: Standard BN uses per-GPU statistics. With small per-GPU batches, use SyncBatchNorm to compute statistics across all GPUs.
Mixed precision: Always use AMP (Automatic Mixed Precision) with GradScaler. Free 2x speedup and 50% memory reduction.
Gradient accumulation: If each GPU cannot fit the desired batch size in memory, accumulate gradients over multiple forward passes before updating.

# Launch: torchrun --nproc_per_node=4 train.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

def train_ddp():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    device = torch.device(f"cuda:{rank}")

    model = MyModel().to(device)
    model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
    model = DDP(model, device_ids=[rank])

    sampler = DistributedSampler(dataset)
    loader = DataLoader(dataset, batch_size=32, sampler=sampler,
                        num_workers=4, pin_memory=True)

    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3 * 4)
    scaler = torch.amp.GradScaler("cuda")

    for epoch in range(100):
        sampler.set_epoch(epoch)  # ensure different shuffling each epoch
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()

            with torch.amp.autocast("cuda"):
                loss = model(images, labels)

            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

    dist.destroy_process_group()

← Previous Advanced CV Topics Next → Practice Questions & Tips