Advanced

Advanced CV Topics

These 10 questions cover cutting-edge computer vision topics that senior and research-oriented roles test heavily. Vision transformers, self-supervised learning, GANs, 3D vision, and video understanding are increasingly expected knowledge in 2024–2026 interviews.

Q1: Explain the Vision Transformer (ViT). How does it differ from CNNs?

💡
Model Answer:

ViT (Dosovitskiy et al., 2020) applies the standard transformer encoder directly to image patches, treating image recognition as a sequence-to-one problem.

Architecture:

  1. Patch embedding: Split image into fixed-size patches (typically 16x16). Flatten each patch and project through a linear layer to get patch tokens. A 224x224 image yields 196 tokens (14x14 grid).
  2. Position embeddings: Learnable 1D positional embeddings added to each patch token (the model learns spatial relationships).
  3. [CLS] token: A learnable classification token prepended to the sequence. Its output serves as the image representation for classification.
  4. Transformer encoder: Standard multi-head self-attention + FFN blocks (12 layers for ViT-B, 24 for ViT-L).
  5. Classification head: MLP head on the [CLS] token output.

Key differences from CNNs:

AspectCNNViT
Inductive biasStrong: locality (convolution) + translation equivarianceWeak: only sequence order. Must learn spatial relationships from data
Receptive fieldGrows gradually through depthGlobal from layer 1 (self-attention sees all patches)
Data efficiencyBetter with limited data due to inductive biasesNeeds large datasets (ImageNet-21K or JFT-300M) or strong augmentation
Compute scalingPlateaus at large scaleScales better: performance improves consistently with more data and compute

Modern variants: DeiT (data-efficient training with distillation), Swin Transformer (shifted window attention for efficiency), BEiT (BERT-style pretraining for vision).

Q2: What is self-supervised learning in computer vision? Compare MAE and DINO.

💡
Model Answer:

Self-supervised learning (SSL) learns visual representations from unlabeled images by creating pretext tasks. The model learns features that transfer well to downstream tasks without requiring manual labels.

MAE (Masked Autoencoder, He et al., 2022):

  • Approach: Mask 75% of image patches randomly, encode only visible patches with ViT, then decode to reconstruct masked patches
  • Why 75% masking works: Images have high spatial redundancy (unlike text). Masking most patches forces the model to learn semantic features, not just interpolate from neighbors
  • Efficiency: Only 25% of patches go through the encoder, making pretraining 3x faster than standard ViT training
  • Strength: Excellent for fine-tuning on downstream tasks. Simple and scalable

DINO / DINOv2 (Meta, 2021/2023):

  • Approach: Self-distillation with no labels. Student network sees local crops, teacher sees global crops. Student learns to match teacher outputs. Teacher is an exponential moving average (EMA) of the student
  • Key property: Learns features that contain explicit object boundaries. Attention maps from DINO naturally segment objects without any segmentation training
  • DINOv2: Scaled up with curated data (LVD-142M dataset), combining self-distillation with masked image modeling. Produces universal features that work across classification, segmentation, depth estimation, and retrieval without fine-tuning

When to use which: MAE is best when you will fine-tune on your specific task. DINOv2 is best for frozen feature extraction or when you need universal features across multiple tasks.

Q3: Explain how GANs work. What are common failure modes?

💡
Model Answer:

GANs consist of two networks trained adversarially:

  • Generator (G): Maps random noise z ~ N(0,1) to synthetic images. Goal: produce images indistinguishable from real data.
  • Discriminator (D): Binary classifier that distinguishes real images from generated ones. Goal: correctly identify fakes.

Training objective: min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]. The generator and discriminator play a minimax game. At equilibrium, G produces perfect samples and D outputs 0.5 for everything.

Common failure modes:

ProblemSymptomsSolutions
Mode collapseGenerator produces only a few distinct samples regardless of input noiseMinibatch discrimination, unrolled GAN, progressive growing, Wasserstein loss
Training instabilityLoss oscillates wildly, D wins too easily, or both losses plateauSpectral normalization, gradient penalty (WGAN-GP), two-timescale update rule (TTUR)
Vanishing gradientsD is too good → G gets zero gradient signal. No learning occursUse Wasserstein distance (WGAN), non-saturating GAN loss, label smoothing
Evaluation difficultyNo single loss value indicates quality. Low G loss does not mean good imagesUse FID (Frechet Inception Distance), IS (Inception Score), visual inspection

Key GAN variants for CV: StyleGAN (high-quality face generation with style control), Pix2Pix (paired image translation), CycleGAN (unpaired translation), GauGAN (semantic layout to photo). Note that diffusion models have largely superseded GANs for image generation since 2022.

Q4: How do diffusion models work? Why have they surpassed GANs?

💡
Model Answer:

Diffusion models learn to generate images by reversing a gradual noising process:

  1. Forward process (fixed): Gradually add Gaussian noise to a clean image over T timesteps until it becomes pure noise. Each step: x_t = sqrt(alpha_t) * x_{t-1} + sqrt(1-alpha_t) * noise
  2. Reverse process (learned): A neural network (typically U-Net or Transformer) learns to predict and remove the noise at each step, recovering the original image from noise
  3. Training: Sample a random timestep t, add noise to a clean image to get x_t, and train the network to predict the added noise. Simple MSE loss: L = ||noise - noise_pred(x_t, t)||^2
  4. Inference: Start from pure noise and iteratively denoise for T steps (typically 20–50 with DDIM sampling)

Why they surpassed GANs:

  • Training stability: Simple MSE loss. No adversarial training, no mode collapse, no training tricks needed
  • Mode coverage: Diffusion models cover the full data distribution, generating diverse outputs. GANs tend toward mode collapse
  • Quality: Achieve lower FID scores than GANs on most benchmarks
  • Controllability: Classifier-free guidance enables precise control over generation. Text conditioning (Stable Diffusion, DALL-E) is natural with cross-attention

Key models: DDPM (foundational), Stable Diffusion / SDXL (latent diffusion in compressed space for efficiency), DALL-E 3, Imagen, DiT (Diffusion Transformer — replaces U-Net with transformer, used in Sora).

Limitation: Slower than GANs at inference (multiple denoising steps vs single forward pass). Consistency models and distillation (LCM) address this, achieving 1–4 step generation.

Q5: What is contrastive learning in vision? Explain CLIP.

💡
Model Answer:

Contrastive learning trains models by pulling together representations of similar pairs and pushing apart dissimilar pairs in embedding space.

CLIP (Contrastive Language-Image Pretraining, OpenAI 2021):

  • Architecture: Two encoders — an image encoder (ViT or ResNet) and a text encoder (Transformer). Both produce embeddings in a shared space.
  • Training: Given a batch of N (image, text) pairs, CLIP maximizes cosine similarity between matched pairs and minimizes it for unmatched pairs. InfoNCE loss with temperature scaling.
  • Data: 400M image-text pairs scraped from the internet (WIT dataset).
  • Zero-shot classification: To classify an image, compute cosine similarity between the image embedding and text embeddings of class descriptions (e.g., "a photo of a cat", "a photo of a dog"). No training on target classes needed.

Why CLIP is significant:

  • Zero-shot transfer: Competitive with supervised ResNet-50 on ImageNet without seeing any ImageNet training data
  • Open vocabulary: Can recognize any concept describable in text, not limited to a fixed label set
  • Foundation for multimodal AI: CLIP embeddings are used in Stable Diffusion (image generation), LLaVA (vision-language models), and many retrieval systems

Limitations: Struggles with compositional reasoning ("a red cube on a blue sphere"), counting, spatial relationships, and fine-grained recognition without additional training.

Q6: What is 3D computer vision? Name the main tasks and approaches.

💡
Model Answer:

3D computer vision processes and understands three-dimensional structure from visual data. It is critical for autonomous driving, robotics, AR/VR, and manufacturing.

Main tasks:

TaskInputOutputKey Methods
Monocular Depth EstimationSingle RGB imageDense depth mapMiDaS, DPT, Depth Anything. Use relative depth cues (texture gradients, occlusion, perspective)
Stereo MatchingStereo image pairDisparity map → depthRAFT-Stereo, LEAStereo. Computes per-pixel correspondence between left and right views
3D Object DetectionLiDAR point cloud and/or images3D bounding boxes (x,y,z,w,h,l,yaw)PointPillars, CenterPoint, BEVFormer. LiDAR-camera fusion is state-of-the-art
Point Cloud Processing3D point cloudClassification, segmentationPointNet, PointNet++, Point Transformer. Key challenge: unordered, irregular data
Neural Radiance Fields (NeRF)Multiple posed imagesNovel view synthesisNeRF, Instant-NGP, 3D Gaussian Splatting. Represent scenes as continuous volumetric functions
SLAMVideo streamCamera trajectory + 3D mapORB-SLAM3, DROID-SLAM. Simultaneous localization and mapping for robotics/AR

3D Gaussian Splatting (2023): Represents scenes as millions of 3D Gaussians with learned positions, covariances, colors, and opacities. Renders via splatting (projecting Gaussians to 2D). 100x faster than NeRF for rendering, enabling real-time novel view synthesis.

Q7: How do video understanding models work? Compare frame-level vs temporal approaches.

💡
Model Answer:

Video understanding requires processing spatial (per-frame) and temporal (across-frame) information. The key challenge is computational cost: a 10-second video at 30fps has 300 frames.

ApproachHow It WorksProsCons
Frame-level (2D CNN + pooling)Process each frame independently with a 2D CNN, aggregate predictions (average, max, or attention pooling)Simple, can use pretrained ImageNet models, fastIgnores temporal relationships. Cannot understand motion or actions that span frames
3D CNNs (C3D, I3D, SlowFast)Use 3D convolutions (spatial + temporal) to process video clips jointlyCaptures spatiotemporal patterns. SlowFast uses dual pathways at different frame ratesComputationally expensive. 3D convolutions are memory-intensive. Limited temporal range
Video Transformers (TimeSformer, ViViT)Apply self-attention across space and time. Factored attention (spatial-then-temporal) for efficiencyLong-range temporal modeling. State-of-the-art accuracyVery expensive. Quadratic complexity in sequence length. Requires significant compute for pretraining
Video-language models (VideoCLIP, InternVideo)Align video and text representations in a shared space using contrastive learningZero-shot video understanding, open vocabulary recognitionRequires massive pretraining data. Still struggles with temporal reasoning

Practical approach for most tasks: Sample sparse frames (e.g., 8–16 frames uniformly), process with a pretrained ViT backbone, add a lightweight temporal head (temporal attention or MLP). This balances accuracy and compute for action recognition, video classification, and temporal localization.

Q8: What are vision-language models (VLMs)? Give examples and applications.

💡
Model Answer:

VLMs jointly understand images and text, enabling tasks that require both visual and linguistic reasoning.

Key models:

  • CLIP (OpenAI, 2021): Contrastive image-text alignment. Zero-shot classification, image retrieval, as backbone for generation models.
  • LLaVA (2023): Connects a vision encoder (CLIP ViT) to an LLM (LLaMA/Vicuna) with a simple linear projection. Enables visual question answering, image description, and reasoning about images.
  • GPT-4V / GPT-4o (OpenAI): Multimodal LLM that natively processes images alongside text. State-of-the-art on visual reasoning benchmarks.
  • Florence-2 (Microsoft, 2024): Unified model for captioning, detection, segmentation, and OCR using a sequence-to-sequence architecture with special tokens for spatial outputs.

Applications:

  • Visual question answering (VQA): "What color is the car in the background?"
  • Image captioning and description
  • Visual grounding: "Point to the person wearing a red hat"
  • Document understanding: parsing invoices, receipts, charts
  • Robotic instruction following: understanding visual scenes from natural language commands

Interview insight: The trend is toward unified architectures that handle all vision tasks through language. Instead of separate models for detection, segmentation, and captioning, a single VLM can do all of them by formulating outputs as text sequences.

Q9: What is optical flow and how is it used in video analysis?

💡
Model Answer:

Optical flow estimates the 2D motion field between consecutive video frames. For each pixel in frame t, it predicts where that pixel moves to in frame t+1, producing a dense displacement map (u, v) per pixel.

Classical methods:

  • Lucas-Kanade: Assumes constant motion in a local window. Solves a least-squares problem. Fast but only works for small motions.
  • Horn-Schunck: Global method with smoothness regularization. Handles larger motions but slower.
  • Farneback: Polynomial expansion method. Good balance of speed and quality. Default in OpenCV.

Deep learning methods:

  • FlowNet / FlowNet2: First end-to-end CNN for optical flow. Correlation layer computes feature matching.
  • RAFT (2020): State-of-the-art. Builds a 4D correlation volume between all pairs of pixels, then iteratively refines flow using a GRU-based update operator. Key insight: iterative refinement allows handling large displacements.
  • GMFlow (2022): Uses transformer-based global matching for better long-range correspondence.

Applications: Action recognition (two-stream networks use optical flow as motion input), video stabilization, frame interpolation, video object tracking, autonomous driving (ego-motion estimation), and video compression.

Q10: What is multi-object tracking (MOT) and what are the main approaches?

💡
Model Answer:

MOT assigns consistent identity labels to multiple objects across video frames. The challenge is maintaining identities through occlusions, appearance changes, and crowded scenes.

Tracking-by-detection paradigm (dominant approach):

  1. Detection: Run an object detector (YOLOv8, Faster R-CNN) on each frame independently
  2. Association: Match detections across frames using appearance features, motion prediction, and spatial proximity
  3. Track management: Initialize new tracks for unmatched detections, terminate tracks for missing objects, handle re-identification

Key methods:

MethodApproachWhen to Use
SORTKalman filter for motion prediction + Hungarian algorithm for assignment using IoUSimple, fast, works when objects rarely occlude. Good baseline
DeepSORTSORT + deep appearance features (ReID network) for better re-identification after occlusionStandard choice for most applications. Balances speed and accuracy
ByteTrackUses low-confidence detections (which SORT discards) in a second association step to recover occluded objectsCrowded scenes where objects are frequently partially occluded
BoT-SORTCombines camera motion compensation, improved Kalman filter, and strong ReID featuresState-of-the-art accuracy when compute is available

Evaluation metrics: MOTA (Multiple Object Tracking Accuracy), IDF1 (ID F1 score — measures identity preservation), HOTA (Higher Order Tracking Accuracy — combines detection and association quality).