Advanced

Advanced CV Topics

These 10 questions cover cutting-edge computer vision topics that senior and research-oriented roles test heavily. Vision transformers, self-supervised learning, GANs, 3D vision, and video understanding are increasingly expected knowledge in 2024–2026 interviews.

Q1: Explain the Vision Transformer (ViT). How does it differ from CNNs?

💡

Model Answer:

ViT (Dosovitskiy et al., 2020) applies the standard transformer encoder directly to image patches, treating image recognition as a sequence-to-one problem.

Architecture:

Patch embedding: Split image into fixed-size patches (typically 16x16). Flatten each patch and project through a linear layer to get patch tokens. A 224x224 image yields 196 tokens (14x14 grid).
Position embeddings: Learnable 1D positional embeddings added to each patch token (the model learns spatial relationships).
[CLS] token: A learnable classification token prepended to the sequence. Its output serves as the image representation for classification.
Transformer encoder: Standard multi-head self-attention + FFN blocks (12 layers for ViT-B, 24 for ViT-L).
Classification head: MLP head on the [CLS] token output.

Key differences from CNNs:

Aspect	CNN	ViT
Inductive bias	Strong: locality (convolution) + translation equivariance	Weak: only sequence order. Must learn spatial relationships from data
Receptive field	Grows gradually through depth	Global from layer 1 (self-attention sees all patches)
Data efficiency	Better with limited data due to inductive biases	Needs large datasets (ImageNet-21K or JFT-300M) or strong augmentation
Compute scaling	Plateaus at large scale	Scales better: performance improves consistently with more data and compute

Modern variants: DeiT (data-efficient training with distillation), Swin Transformer (shifted window attention for efficiency), BEiT (BERT-style pretraining for vision).

Q2: What is self-supervised learning in computer vision? Compare MAE and DINO.

💡

Model Answer:

Self-supervised learning (SSL) learns visual representations from unlabeled images by creating pretext tasks. The model learns features that transfer well to downstream tasks without requiring manual labels.

MAE (Masked Autoencoder, He et al., 2022):

Approach: Mask 75% of image patches randomly, encode only visible patches with ViT, then decode to reconstruct masked patches
Why 75% masking works: Images have high spatial redundancy (unlike text). Masking most patches forces the model to learn semantic features, not just interpolate from neighbors
Efficiency: Only 25% of patches go through the encoder, making pretraining 3x faster than standard ViT training
Strength: Excellent for fine-tuning on downstream tasks. Simple and scalable

DINO / DINOv2 (Meta, 2021/2023):

Approach: Self-distillation with no labels. Student network sees local crops, teacher sees global crops. Student learns to match teacher outputs. Teacher is an exponential moving average (EMA) of the student
Key property: Learns features that contain explicit object boundaries. Attention maps from DINO naturally segment objects without any segmentation training
DINOv2: Scaled up with curated data (LVD-142M dataset), combining self-distillation with masked image modeling. Produces universal features that work across classification, segmentation, depth estimation, and retrieval without fine-tuning

When to use which: MAE is best when you will fine-tune on your specific task. DINOv2 is best for frozen feature extraction or when you need universal features across multiple tasks.

Q3: Explain how GANs work. What are common failure modes?

💡

Model Answer:

GANs consist of two networks trained adversarially:

Generator (G): Maps random noise z ~ N(0,1) to synthetic images. Goal: produce images indistinguishable from real data.
Discriminator (D): Binary classifier that distinguishes real images from generated ones. Goal: correctly identify fakes.

Training objective: min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]. The generator and discriminator play a minimax game. At equilibrium, G produces perfect samples and D outputs 0.5 for everything.

Common failure modes:

Problem	Symptoms	Solutions
Mode collapse	Generator produces only a few distinct samples regardless of input noise	Minibatch discrimination, unrolled GAN, progressive growing, Wasserstein loss
Training instability	Loss oscillates wildly, D wins too easily, or both losses plateau	Spectral normalization, gradient penalty (WGAN-GP), two-timescale update rule (TTUR)
Vanishing gradients	D is too good → G gets zero gradient signal. No learning occurs	Use Wasserstein distance (WGAN), non-saturating GAN loss, label smoothing
Evaluation difficulty	No single loss value indicates quality. Low G loss does not mean good images	Use FID (Frechet Inception Distance), IS (Inception Score), visual inspection

Key GAN variants for CV: StyleGAN (high-quality face generation with style control), Pix2Pix (paired image translation), CycleGAN (unpaired translation), GauGAN (semantic layout to photo). Note that diffusion models have largely superseded GANs for image generation since 2022.

Q4: How do diffusion models work? Why have they surpassed GANs?

💡

Model Answer:

Diffusion models learn to generate images by reversing a gradual noising process:

Forward process (fixed): Gradually add Gaussian noise to a clean image over T timesteps until it becomes pure noise. Each step: x_t = sqrt(alpha_t) * x_{t-1} + sqrt(1-alpha_t) * noise
Reverse process (learned): A neural network (typically U-Net or Transformer) learns to predict and remove the noise at each step, recovering the original image from noise
Training: Sample a random timestep t, add noise to a clean image to get x_t, and train the network to predict the added noise. Simple MSE loss: L = ||noise - noise_pred(x_t, t)||^2
Inference: Start from pure noise and iteratively denoise for T steps (typically 20–50 with DDIM sampling)

Why they surpassed GANs:

Training stability: Simple MSE loss. No adversarial training, no mode collapse, no training tricks needed
Mode coverage: Diffusion models cover the full data distribution, generating diverse outputs. GANs tend toward mode collapse
Quality: Achieve lower FID scores than GANs on most benchmarks
Controllability: Classifier-free guidance enables precise control over generation. Text conditioning (Stable Diffusion, DALL-E) is natural with cross-attention

Key models: DDPM (foundational), Stable Diffusion / SDXL (latent diffusion in compressed space for efficiency), DALL-E 3, Imagen, DiT (Diffusion Transformer — replaces U-Net with transformer, used in Sora).

Limitation: Slower than GANs at inference (multiple denoising steps vs single forward pass). Consistency models and distillation (LCM) address this, achieving 1–4 step generation.

Q5: What is contrastive learning in vision? Explain CLIP.

💡

Model Answer:

Contrastive learning trains models by pulling together representations of similar pairs and pushing apart dissimilar pairs in embedding space.

CLIP (Contrastive Language-Image Pretraining, OpenAI 2021):

Architecture: Two encoders — an image encoder (ViT or ResNet) and a text encoder (Transformer). Both produce embeddings in a shared space.
Training: Given a batch of N (image, text) pairs, CLIP maximizes cosine similarity between matched pairs and minimizes it for unmatched pairs. InfoNCE loss with temperature scaling.
Data: 400M image-text pairs scraped from the internet (WIT dataset).
Zero-shot classification: To classify an image, compute cosine similarity between the image embedding and text embeddings of class descriptions (e.g., "a photo of a cat", "a photo of a dog"). No training on target classes needed.

Why CLIP is significant:

Zero-shot transfer: Competitive with supervised ResNet-50 on ImageNet without seeing any ImageNet training data
Open vocabulary: Can recognize any concept describable in text, not limited to a fixed label set
Foundation for multimodal AI: CLIP embeddings are used in Stable Diffusion (image generation), LLaVA (vision-language models), and many retrieval systems

Limitations: Struggles with compositional reasoning ("a red cube on a blue sphere"), counting, spatial relationships, and fine-grained recognition without additional training.

Q6: What is 3D computer vision? Name the main tasks and approaches.

💡

Model Answer:

3D computer vision processes and understands three-dimensional structure from visual data. It is critical for autonomous driving, robotics, AR/VR, and manufacturing.

Main tasks:

Task	Input	Output	Key Methods
Monocular Depth Estimation	Single RGB image	Dense depth map	MiDaS, DPT, Depth Anything. Use relative depth cues (texture gradients, occlusion, perspective)
Stereo Matching	Stereo image pair	Disparity map → depth	RAFT-Stereo, LEAStereo. Computes per-pixel correspondence between left and right views
3D Object Detection	LiDAR point cloud and/or images	3D bounding boxes (x,y,z,w,h,l,yaw)	PointPillars, CenterPoint, BEVFormer. LiDAR-camera fusion is state-of-the-art
Point Cloud Processing	3D point cloud	Classification, segmentation	PointNet, PointNet++, Point Transformer. Key challenge: unordered, irregular data
Neural Radiance Fields (NeRF)	Multiple posed images	Novel view synthesis	NeRF, Instant-NGP, 3D Gaussian Splatting. Represent scenes as continuous volumetric functions
SLAM	Video stream	Camera trajectory + 3D map	ORB-SLAM3, DROID-SLAM. Simultaneous localization and mapping for robotics/AR

3D Gaussian Splatting (2023): Represents scenes as millions of 3D Gaussians with learned positions, covariances, colors, and opacities. Renders via splatting (projecting Gaussians to 2D). 100x faster than NeRF for rendering, enabling real-time novel view synthesis.

Q7: How do video understanding models work? Compare frame-level vs temporal approaches.

💡

Model Answer:

Video understanding requires processing spatial (per-frame) and temporal (across-frame) information. The key challenge is computational cost: a 10-second video at 30fps has 300 frames.

Approach	How It Works	Pros	Cons
Frame-level (2D CNN + pooling)	Process each frame independently with a 2D CNN, aggregate predictions (average, max, or attention pooling)	Simple, can use pretrained ImageNet models, fast	Ignores temporal relationships. Cannot understand motion or actions that span frames
3D CNNs (C3D, I3D, SlowFast)	Use 3D convolutions (spatial + temporal) to process video clips jointly	Captures spatiotemporal patterns. SlowFast uses dual pathways at different frame rates	Computationally expensive. 3D convolutions are memory-intensive. Limited temporal range
Video Transformers (TimeSformer, ViViT)	Apply self-attention across space and time. Factored attention (spatial-then-temporal) for efficiency	Long-range temporal modeling. State-of-the-art accuracy	Very expensive. Quadratic complexity in sequence length. Requires significant compute for pretraining
Video-language models (VideoCLIP, InternVideo)	Align video and text representations in a shared space using contrastive learning	Zero-shot video understanding, open vocabulary recognition	Requires massive pretraining data. Still struggles with temporal reasoning

Practical approach for most tasks: Sample sparse frames (e.g., 8–16 frames uniformly), process with a pretrained ViT backbone, add a lightweight temporal head (temporal attention or MLP). This balances accuracy and compute for action recognition, video classification, and temporal localization.

Q8: What are vision-language models (VLMs)? Give examples and applications.

💡

Model Answer:

VLMs jointly understand images and text, enabling tasks that require both visual and linguistic reasoning.

Key models:

CLIP (OpenAI, 2021): Contrastive image-text alignment. Zero-shot classification, image retrieval, as backbone for generation models.
LLaVA (2023): Connects a vision encoder (CLIP ViT) to an LLM (LLaMA/Vicuna) with a simple linear projection. Enables visual question answering, image description, and reasoning about images.
GPT-4V / GPT-4o (OpenAI): Multimodal LLM that natively processes images alongside text. State-of-the-art on visual reasoning benchmarks.
Florence-2 (Microsoft, 2024): Unified model for captioning, detection, segmentation, and OCR using a sequence-to-sequence architecture with special tokens for spatial outputs.

Applications:

Visual question answering (VQA): "What color is the car in the background?"
Image captioning and description
Visual grounding: "Point to the person wearing a red hat"
Document understanding: parsing invoices, receipts, charts
Robotic instruction following: understanding visual scenes from natural language commands

Interview insight: The trend is toward unified architectures that handle all vision tasks through language. Instead of separate models for detection, segmentation, and captioning, a single VLM can do all of them by formulating outputs as text sequences.

Q9: What is optical flow and how is it used in video analysis?

💡

Model Answer:

Optical flow estimates the 2D motion field between consecutive video frames. For each pixel in frame t, it predicts where that pixel moves to in frame t+1, producing a dense displacement map (u, v) per pixel.

Classical methods:

Lucas-Kanade: Assumes constant motion in a local window. Solves a least-squares problem. Fast but only works for small motions.
Horn-Schunck: Global method with smoothness regularization. Handles larger motions but slower.
Farneback: Polynomial expansion method. Good balance of speed and quality. Default in OpenCV.

Deep learning methods:

FlowNet / FlowNet2: First end-to-end CNN for optical flow. Correlation layer computes feature matching.
RAFT (2020): State-of-the-art. Builds a 4D correlation volume between all pairs of pixels, then iteratively refines flow using a GRU-based update operator. Key insight: iterative refinement allows handling large displacements.
GMFlow (2022): Uses transformer-based global matching for better long-range correspondence.

Applications: Action recognition (two-stream networks use optical flow as motion input), video stabilization, frame interpolation, video object tracking, autonomous driving (ego-motion estimation), and video compression.

Q10: What is multi-object tracking (MOT) and what are the main approaches?

💡

Model Answer:

MOT assigns consistent identity labels to multiple objects across video frames. The challenge is maintaining identities through occlusions, appearance changes, and crowded scenes.

Tracking-by-detection paradigm (dominant approach):

Detection: Run an object detector (YOLOv8, Faster R-CNN) on each frame independently
Association: Match detections across frames using appearance features, motion prediction, and spatial proximity
Track management: Initialize new tracks for unmatched detections, terminate tracks for missing objects, handle re-identification

Key methods:

Method	Approach	When to Use
SORT	Kalman filter for motion prediction + Hungarian algorithm for assignment using IoU	Simple, fast, works when objects rarely occlude. Good baseline
DeepSORT	SORT + deep appearance features (ReID network) for better re-identification after occlusion	Standard choice for most applications. Balances speed and accuracy
ByteTrack	Uses low-confidence detections (which SORT discards) in a second association step to recover occluded objects	Crowded scenes where objects are frequently partially occluded
BoT-SORT	Combines camera motion compensation, improved Kalman filter, and strong ReID features	State-of-the-art accuracy when compute is available

Evaluation metrics: MOTA (Multiple Object Tracking Accuracy), IDF1 (ID F1 score — measures identity preservation), HOTA (Higher Order Tracking Accuracy — combines detection and association quality).

← Previous Segmentation Questions Next → Practical CV Challenges