Advanced CV Topics
These 10 questions cover cutting-edge computer vision topics that senior and research-oriented roles test heavily. Vision transformers, self-supervised learning, GANs, 3D vision, and video understanding are increasingly expected knowledge in 2024–2026 interviews.
Q1: Explain the Vision Transformer (ViT). How does it differ from CNNs?
ViT (Dosovitskiy et al., 2020) applies the standard transformer encoder directly to image patches, treating image recognition as a sequence-to-one problem.
Architecture:
- Patch embedding: Split image into fixed-size patches (typically 16x16). Flatten each patch and project through a linear layer to get patch tokens. A 224x224 image yields 196 tokens (14x14 grid).
- Position embeddings: Learnable 1D positional embeddings added to each patch token (the model learns spatial relationships).
- [CLS] token: A learnable classification token prepended to the sequence. Its output serves as the image representation for classification.
- Transformer encoder: Standard multi-head self-attention + FFN blocks (12 layers for ViT-B, 24 for ViT-L).
- Classification head: MLP head on the [CLS] token output.
Key differences from CNNs:
| Aspect | CNN | ViT |
|---|---|---|
| Inductive bias | Strong: locality (convolution) + translation equivariance | Weak: only sequence order. Must learn spatial relationships from data |
| Receptive field | Grows gradually through depth | Global from layer 1 (self-attention sees all patches) |
| Data efficiency | Better with limited data due to inductive biases | Needs large datasets (ImageNet-21K or JFT-300M) or strong augmentation |
| Compute scaling | Plateaus at large scale | Scales better: performance improves consistently with more data and compute |
Modern variants: DeiT (data-efficient training with distillation), Swin Transformer (shifted window attention for efficiency), BEiT (BERT-style pretraining for vision).
Q2: What is self-supervised learning in computer vision? Compare MAE and DINO.
Self-supervised learning (SSL) learns visual representations from unlabeled images by creating pretext tasks. The model learns features that transfer well to downstream tasks without requiring manual labels.
MAE (Masked Autoencoder, He et al., 2022):
- Approach: Mask 75% of image patches randomly, encode only visible patches with ViT, then decode to reconstruct masked patches
- Why 75% masking works: Images have high spatial redundancy (unlike text). Masking most patches forces the model to learn semantic features, not just interpolate from neighbors
- Efficiency: Only 25% of patches go through the encoder, making pretraining 3x faster than standard ViT training
- Strength: Excellent for fine-tuning on downstream tasks. Simple and scalable
DINO / DINOv2 (Meta, 2021/2023):
- Approach: Self-distillation with no labels. Student network sees local crops, teacher sees global crops. Student learns to match teacher outputs. Teacher is an exponential moving average (EMA) of the student
- Key property: Learns features that contain explicit object boundaries. Attention maps from DINO naturally segment objects without any segmentation training
- DINOv2: Scaled up with curated data (LVD-142M dataset), combining self-distillation with masked image modeling. Produces universal features that work across classification, segmentation, depth estimation, and retrieval without fine-tuning
When to use which: MAE is best when you will fine-tune on your specific task. DINOv2 is best for frozen feature extraction or when you need universal features across multiple tasks.
Q3: Explain how GANs work. What are common failure modes?
GANs consist of two networks trained adversarially:
- Generator (G): Maps random noise z ~ N(0,1) to synthetic images. Goal: produce images indistinguishable from real data.
- Discriminator (D): Binary classifier that distinguishes real images from generated ones. Goal: correctly identify fakes.
Training objective: min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]. The generator and discriminator play a minimax game. At equilibrium, G produces perfect samples and D outputs 0.5 for everything.
Common failure modes:
| Problem | Symptoms | Solutions |
|---|---|---|
| Mode collapse | Generator produces only a few distinct samples regardless of input noise | Minibatch discrimination, unrolled GAN, progressive growing, Wasserstein loss |
| Training instability | Loss oscillates wildly, D wins too easily, or both losses plateau | Spectral normalization, gradient penalty (WGAN-GP), two-timescale update rule (TTUR) |
| Vanishing gradients | D is too good → G gets zero gradient signal. No learning occurs | Use Wasserstein distance (WGAN), non-saturating GAN loss, label smoothing |
| Evaluation difficulty | No single loss value indicates quality. Low G loss does not mean good images | Use FID (Frechet Inception Distance), IS (Inception Score), visual inspection |
Key GAN variants for CV: StyleGAN (high-quality face generation with style control), Pix2Pix (paired image translation), CycleGAN (unpaired translation), GauGAN (semantic layout to photo). Note that diffusion models have largely superseded GANs for image generation since 2022.
Q4: How do diffusion models work? Why have they surpassed GANs?
Diffusion models learn to generate images by reversing a gradual noising process:
- Forward process (fixed): Gradually add Gaussian noise to a clean image over T timesteps until it becomes pure noise. Each step:
x_t = sqrt(alpha_t) * x_{t-1} + sqrt(1-alpha_t) * noise - Reverse process (learned): A neural network (typically U-Net or Transformer) learns to predict and remove the noise at each step, recovering the original image from noise
- Training: Sample a random timestep t, add noise to a clean image to get x_t, and train the network to predict the added noise. Simple MSE loss:
L = ||noise - noise_pred(x_t, t)||^2 - Inference: Start from pure noise and iteratively denoise for T steps (typically 20–50 with DDIM sampling)
Why they surpassed GANs:
- Training stability: Simple MSE loss. No adversarial training, no mode collapse, no training tricks needed
- Mode coverage: Diffusion models cover the full data distribution, generating diverse outputs. GANs tend toward mode collapse
- Quality: Achieve lower FID scores than GANs on most benchmarks
- Controllability: Classifier-free guidance enables precise control over generation. Text conditioning (Stable Diffusion, DALL-E) is natural with cross-attention
Key models: DDPM (foundational), Stable Diffusion / SDXL (latent diffusion in compressed space for efficiency), DALL-E 3, Imagen, DiT (Diffusion Transformer — replaces U-Net with transformer, used in Sora).
Limitation: Slower than GANs at inference (multiple denoising steps vs single forward pass). Consistency models and distillation (LCM) address this, achieving 1–4 step generation.
Q5: What is contrastive learning in vision? Explain CLIP.
Contrastive learning trains models by pulling together representations of similar pairs and pushing apart dissimilar pairs in embedding space.
CLIP (Contrastive Language-Image Pretraining, OpenAI 2021):
- Architecture: Two encoders — an image encoder (ViT or ResNet) and a text encoder (Transformer). Both produce embeddings in a shared space.
- Training: Given a batch of N (image, text) pairs, CLIP maximizes cosine similarity between matched pairs and minimizes it for unmatched pairs. InfoNCE loss with temperature scaling.
- Data: 400M image-text pairs scraped from the internet (WIT dataset).
- Zero-shot classification: To classify an image, compute cosine similarity between the image embedding and text embeddings of class descriptions (e.g., "a photo of a cat", "a photo of a dog"). No training on target classes needed.
Why CLIP is significant:
- Zero-shot transfer: Competitive with supervised ResNet-50 on ImageNet without seeing any ImageNet training data
- Open vocabulary: Can recognize any concept describable in text, not limited to a fixed label set
- Foundation for multimodal AI: CLIP embeddings are used in Stable Diffusion (image generation), LLaVA (vision-language models), and many retrieval systems
Limitations: Struggles with compositional reasoning ("a red cube on a blue sphere"), counting, spatial relationships, and fine-grained recognition without additional training.
Q6: What is 3D computer vision? Name the main tasks and approaches.
3D computer vision processes and understands three-dimensional structure from visual data. It is critical for autonomous driving, robotics, AR/VR, and manufacturing.
Main tasks:
| Task | Input | Output | Key Methods |
|---|---|---|---|
| Monocular Depth Estimation | Single RGB image | Dense depth map | MiDaS, DPT, Depth Anything. Use relative depth cues (texture gradients, occlusion, perspective) |
| Stereo Matching | Stereo image pair | Disparity map → depth | RAFT-Stereo, LEAStereo. Computes per-pixel correspondence between left and right views |
| 3D Object Detection | LiDAR point cloud and/or images | 3D bounding boxes (x,y,z,w,h,l,yaw) | PointPillars, CenterPoint, BEVFormer. LiDAR-camera fusion is state-of-the-art |
| Point Cloud Processing | 3D point cloud | Classification, segmentation | PointNet, PointNet++, Point Transformer. Key challenge: unordered, irregular data |
| Neural Radiance Fields (NeRF) | Multiple posed images | Novel view synthesis | NeRF, Instant-NGP, 3D Gaussian Splatting. Represent scenes as continuous volumetric functions |
| SLAM | Video stream | Camera trajectory + 3D map | ORB-SLAM3, DROID-SLAM. Simultaneous localization and mapping for robotics/AR |
3D Gaussian Splatting (2023): Represents scenes as millions of 3D Gaussians with learned positions, covariances, colors, and opacities. Renders via splatting (projecting Gaussians to 2D). 100x faster than NeRF for rendering, enabling real-time novel view synthesis.
Q7: How do video understanding models work? Compare frame-level vs temporal approaches.
Video understanding requires processing spatial (per-frame) and temporal (across-frame) information. The key challenge is computational cost: a 10-second video at 30fps has 300 frames.
| Approach | How It Works | Pros | Cons |
|---|---|---|---|
| Frame-level (2D CNN + pooling) | Process each frame independently with a 2D CNN, aggregate predictions (average, max, or attention pooling) | Simple, can use pretrained ImageNet models, fast | Ignores temporal relationships. Cannot understand motion or actions that span frames |
| 3D CNNs (C3D, I3D, SlowFast) | Use 3D convolutions (spatial + temporal) to process video clips jointly | Captures spatiotemporal patterns. SlowFast uses dual pathways at different frame rates | Computationally expensive. 3D convolutions are memory-intensive. Limited temporal range |
| Video Transformers (TimeSformer, ViViT) | Apply self-attention across space and time. Factored attention (spatial-then-temporal) for efficiency | Long-range temporal modeling. State-of-the-art accuracy | Very expensive. Quadratic complexity in sequence length. Requires significant compute for pretraining |
| Video-language models (VideoCLIP, InternVideo) | Align video and text representations in a shared space using contrastive learning | Zero-shot video understanding, open vocabulary recognition | Requires massive pretraining data. Still struggles with temporal reasoning |
Practical approach for most tasks: Sample sparse frames (e.g., 8–16 frames uniformly), process with a pretrained ViT backbone, add a lightweight temporal head (temporal attention or MLP). This balances accuracy and compute for action recognition, video classification, and temporal localization.
Q8: What are vision-language models (VLMs)? Give examples and applications.
VLMs jointly understand images and text, enabling tasks that require both visual and linguistic reasoning.
Key models:
- CLIP (OpenAI, 2021): Contrastive image-text alignment. Zero-shot classification, image retrieval, as backbone for generation models.
- LLaVA (2023): Connects a vision encoder (CLIP ViT) to an LLM (LLaMA/Vicuna) with a simple linear projection. Enables visual question answering, image description, and reasoning about images.
- GPT-4V / GPT-4o (OpenAI): Multimodal LLM that natively processes images alongside text. State-of-the-art on visual reasoning benchmarks.
- Florence-2 (Microsoft, 2024): Unified model for captioning, detection, segmentation, and OCR using a sequence-to-sequence architecture with special tokens for spatial outputs.
Applications:
- Visual question answering (VQA): "What color is the car in the background?"
- Image captioning and description
- Visual grounding: "Point to the person wearing a red hat"
- Document understanding: parsing invoices, receipts, charts
- Robotic instruction following: understanding visual scenes from natural language commands
Interview insight: The trend is toward unified architectures that handle all vision tasks through language. Instead of separate models for detection, segmentation, and captioning, a single VLM can do all of them by formulating outputs as text sequences.
Q9: What is optical flow and how is it used in video analysis?
Optical flow estimates the 2D motion field between consecutive video frames. For each pixel in frame t, it predicts where that pixel moves to in frame t+1, producing a dense displacement map (u, v) per pixel.
Classical methods:
- Lucas-Kanade: Assumes constant motion in a local window. Solves a least-squares problem. Fast but only works for small motions.
- Horn-Schunck: Global method with smoothness regularization. Handles larger motions but slower.
- Farneback: Polynomial expansion method. Good balance of speed and quality. Default in OpenCV.
Deep learning methods:
- FlowNet / FlowNet2: First end-to-end CNN for optical flow. Correlation layer computes feature matching.
- RAFT (2020): State-of-the-art. Builds a 4D correlation volume between all pairs of pixels, then iteratively refines flow using a GRU-based update operator. Key insight: iterative refinement allows handling large displacements.
- GMFlow (2022): Uses transformer-based global matching for better long-range correspondence.
Applications: Action recognition (two-stream networks use optical flow as motion input), video stabilization, frame interpolation, video object tracking, autonomous driving (ego-motion estimation), and video compression.
Q10: What is multi-object tracking (MOT) and what are the main approaches?
MOT assigns consistent identity labels to multiple objects across video frames. The challenge is maintaining identities through occlusions, appearance changes, and crowded scenes.
Tracking-by-detection paradigm (dominant approach):
- Detection: Run an object detector (YOLOv8, Faster R-CNN) on each frame independently
- Association: Match detections across frames using appearance features, motion prediction, and spatial proximity
- Track management: Initialize new tracks for unmatched detections, terminate tracks for missing objects, handle re-identification
Key methods:
| Method | Approach | When to Use |
|---|---|---|
| SORT | Kalman filter for motion prediction + Hungarian algorithm for assignment using IoU | Simple, fast, works when objects rarely occlude. Good baseline |
| DeepSORT | SORT + deep appearance features (ReID network) for better re-identification after occlusion | Standard choice for most applications. Balances speed and accuracy |
| ByteTrack | Uses low-confidence detections (which SORT discards) in a second association step to recover occluded objects | Crowded scenes where objects are frequently partially occluded |
| BoT-SORT | Combines camera motion compensation, improved Kalman filter, and strong ReID features | State-of-the-art accuracy when compute is available |
Evaluation metrics: MOTA (Multiple Object Tracking Accuracy), IDF1 (ID F1 score — measures identity preservation), HOTA (Higher Order Tracking Accuracy — combines detection and association quality).
Lilly Tech Systems