Practice Questions & Tips
This final lesson brings everything together with rapid-fire questions to test your knowledge, infrastructure design challenges that simulate real interview situations, and strategic tips from successful AI infrastructure interview candidates.
Rapid-Fire Questions
Time yourself: try to answer each in under 60 seconds. These test breadth of knowledge and quick recall — both critical for phone screens and early interview rounds.
| # | Question | Expected Answer (1–2 sentences) |
|---|---|---|
| 1 | What is the difference between HBM and GDDR? | HBM (High Bandwidth Memory) is 3D-stacked memory on the GPU package with massive bandwidth (3+ TB/s on H100). GDDR is traditional memory with lower bandwidth (~1 TB/s) but lower cost. HBM is used in data center GPUs; GDDR in consumer GPUs. |
| 2 | What is a CUDA core vs a Tensor Core? | CUDA cores perform general-purpose floating-point operations (one multiply-add per cycle). Tensor Cores perform matrix multiply-accumulate on 4x4 matrices in a single cycle, delivering 10–20x higher throughput for matrix operations used in deep learning. |
| 3 | What is NVLink and what bandwidth does it provide on H100? | NVLink is NVIDIA's GPU-to-GPU interconnect that bypasses PCIe. On H100, NVLink 4.0 provides 900 GB/s total bidirectional bandwidth via 18 links, roughly 14x faster than PCIe Gen5. |
| 4 | What is the difference between data parallelism and model parallelism? | Data parallelism: replicate model on all GPUs, split data. Model parallelism: split model across GPUs, each GPU holds a portion of parameters. Data parallelism is simpler but requires model to fit on one GPU. |
| 5 | What is AllReduce? | A collective operation where each participant starts with a local value, and all participants end with the global sum/average. In training, it averages gradients across GPUs. Ring AllReduce is the most common implementation. |
| 6 | What is NCCL? | NVIDIA Collective Communications Library. Optimized implementation of AllReduce, AllGather, Broadcast, and other collective operations. Topology-aware: automatically selects optimal algorithms based on NVLink, PCIe, and InfiniBand connectivity. |
| 7 | What is DeepSpeed ZeRO-3? | Partitions model parameters, gradients, and optimizer states across GPUs. Each GPU stores 1/N of everything. AllGathers parameters on-demand during forward/backward pass. Enables training models too large for single-GPU memory. |
| 8 | Why use BF16 instead of FP16 for training? | BF16 has the same exponent range as FP32 (8 exponent bits) so it avoids overflow/underflow issues that require loss scaling in FP16. Slightly lower precision (7 vs 10 mantissa bits) but safer and simpler for training. |
| 9 | What is MIG? | Multi-Instance GPU. Hardware partitioning on A100/H100 that creates up to 7 isolated GPU instances, each with dedicated compute, memory, and memory bandwidth. Used for multi-tenant inference and development environments. |
| 10 | What is GPUDirect RDMA? | Enables network adapters to directly read/write GPU memory without CPU involvement or memory copies. Reduces inter-node communication latency by 3–5x compared to traditional CPU-mediated transfers. |
| 11 | What is InfiniBand NDR bandwidth? | 400 Gbps (50 GB/s) per port. The current standard for AI training clusters. Provides RDMA with sub-microsecond latency, compared to 50–100 microsecond latency for TCP/IP over Ethernet. |
| 12 | What is gradient checkpointing? | Trades compute for memory by recomputing activations during backward pass instead of storing them. Reduces activation memory by ~60% at the cost of ~30% extra compute. Essential for training large models on limited GPU memory. |
| 13 | Name three Kubernetes components needed for GPU workloads. | NVIDIA Device Plugin (exposes GPUs to scheduler), GPU Operator (manages driver/toolkit stack), and a job scheduling system like Kueue or Volcano (gang scheduling and fair queuing). |
| 14 | What is the pipeline bubble in pipeline parallelism? | Idle time when GPUs wait for their pipeline stage's input. With P stages and M micro-batches, bubble ratio = (P-1)/(M+P-1). Mitigated by using many micro-batches (M >> P). |
| 15 | How do you estimate the cost of training a 7B model on 1T tokens? | ~6 x 7B x 1T = 42 ZettaFLOPs. H100 sustains ~500 TFLOPS. ~23,300 GPU-hours. At $4.50/hr on-demand: ~$105K. At $1.50/hr spot: ~$35K. On 64 GPUs: ~15 days wall-clock. |
Infrastructure Design Questions
These simulate real interview scenarios where you must design infrastructure end-to-end. Practice explaining your reasoning process, not just the answer.
Design 1: GPU Cluster for LLM Training
Model Answer:
- Compute estimate: 6 x 30B x 2T = 360 ZettaFLOPs. H100 at ~500 TFLOPS sustained = 720,000 GPU-hours. On 256 GPUs (32 DGX H100 nodes): ~117 days. On 512 GPUs: ~58 days.
- Hardware: 64 DGX H100 nodes (512 GPUs). Each node: 8x H100 80GB, NVSwitch, 8x 400G InfiniBand NDR. Total: 40 TB HBM, 51,200 Gbps aggregate network bandwidth.
- Parallelism strategy: TP=4 (within NVLink), PP=2 (across node pairs), DP=64 (across all groups). 30B model fits in 4-way TP with mixed precision.
- Network: Fat-tree InfiniBand with NDR switches. Rail-optimized topology. Non-blocking within the training partition. Dual-rail for redundancy.
- Storage: 2 PB Lustre for training data (sequential read at 500+ GB/s aggregate). S3 for checkpoints (420 GB per checkpoint, every 30 min). NVMe on each node for local caching.
- Fault tolerance: Async checkpointing every 30 min. TorchElastic for automatic restart. Pre-flight GPU health checks (ECC, NVLink). Spare nodes (4 standby) for hot replacement. Dual-rail InfiniBand for network redundancy.
- Cost estimate: 512 GPUs x 58 days x 24 hrs x $4.50/hr = ~$32M on-demand. With 1-year reserved (~$2/hr): ~$14M. With spot at ~$1.50/hr: ~$10.5M (plus checkpoint/restart overhead).
Design 2: Multi-Tenant GPU Platform
Model Answer:
- Orchestration: Kubernetes with GPU Operator and Kueue for job scheduling. ClusterQueues per team with guaranteed quotas and borrowing.
- Resource allocation: Team quotas based on budget: Team A (40 GPUs guaranteed), Team B (30), Team C (50), Team D (40), Team E (40). Borrowing enabled between teams in same cohort. Priority classes: production > time-sensitive > best-effort.
- GPU types: Label nodes with GPU type (A100, H100). Teams request specific types via nodeSelector. H100s reserved for large-scale training; A100s for development and inference.
- Self-service: Internal portal where researchers submit training jobs by specifying model config, data path, and GPU count. Portal generates Kubernetes manifests from templates. JupyterHub for interactive development with GPU allocation.
- Cost tracking: DCGM metrics tagged with team/project labels. Grafana dashboards showing per-team GPU-hours consumed, utilization rates, and cost. Monthly chargeback reports.
- Efficiency: MIG or time-slicing for development workloads (do not waste full A100s on Jupyter notebooks). Auto-shutdown idle notebooks after 2 hours. Preemptible priority class for experiments.
Interview Strategy Tips
Know Your Numbers
Memorize key specs: H100 HBM (80 GB), H100 bandwidth (3.35 TB/s), NVLink 4.0 (900 GB/s), InfiniBand NDR (400 Gbps), PCIe Gen5 (128 GB/s). Being able to do back-of-envelope calculations with these numbers shows deep understanding, not just memorization.
Draw Before You Talk
For system design questions, sketch the architecture first: GPU nodes, NVLink within nodes, InfiniBand between nodes, storage systems, monitoring. Label each component with specific technologies. This demonstrates systems thinking and helps structure your answer.
Think About Failure Modes
For every component you mention, ask yourself: "What happens when this fails?" GPU dies mid-training? InfiniBand link goes down? Storage becomes slow? Showing that you design for failure is what distinguishes senior from junior candidates.
Cost-Aware Architecture
Always include cost analysis. "We need 512 H100s for 60 days at $4.50/hr = $33M. We can reduce to $11M with 3-year reserved instances, or $8M using spot with checkpointing." This shows you think like an engineering leader, not just an engineer.
Show Debugging Methodology
When asked about troubleshooting, show a systematic approach: (1) What is the symptom? (2) What metrics do I check? (3) What tools do I use? (4) What are the likely causes? (5) How do I verify the root cause? (6) How do I prevent recurrence? This methodology matters more than specific answers.
Discuss Trade-offs Explicitly
"I would use InfiniBand for inter-node training communication because it provides RDMA with sub-microsecond latency. The alternative is RoCE v2, which is 30% cheaper but has higher tail latency due to Ethernet congestion control. For a training run costing $10M in GPU time, the 10% efficiency improvement from IB easily justifies the network cost."
Frequently Asked Questions
Do I need to know CUDA programming for an AI infrastructure interview?
You need to understand CUDA concepts (kernels, thread blocks, memory hierarchy, streams) but you rarely need to write CUDA code. You should be able to explain why a training job is slow by referencing GPU profiler output, understand CUDA error codes and what they mean, and debug GPU memory issues. Think of it like understanding SQL for a database engineer — you need to read and reason about it, even if you are not writing complex queries daily.
How much Linux systems knowledge is expected?
Strong Linux fundamentals are expected: process management, memory management, networking (TCP/IP, routing, firewalls), file systems, performance tools (perf, strace, iostat, sar). You should know how to debug a slow training job at the OS level: is it CPU-bound? Memory pressure? I/O wait? Network latency? Familiarity with NUMA architecture is important because GPU placement relative to CPU and memory nodes affects performance. Know how to use numactl, lscpu, and nvidia-smi topo to understand your system's topology.
Should I focus on one cloud provider or learn all three?
Know one cloud deeply (the one your target company uses) and understand the others at a high level. The concepts are portable: instance types map to equivalent configurations across clouds, networking abstractions are similar, and Kubernetes works the same everywhere. If unsure, focus on AWS (largest market share, most GPU instance variety, Trainium/Inferentia differentiation). But frame your answers in terms of concepts first: "I would use a high-bandwidth GPU interconnect like InfiniBand or EFA" rather than "I would use p5.48xlarge".
What if I come from a traditional infrastructure/DevOps background?
Your infrastructure skills are 50% of what is needed. Focus your preparation on the AI-specific 50%: GPU architecture and memory, distributed training patterns, NCCL and collective operations, InfiniBand/RDMA, and ML-specific Kubernetes patterns (GPU scheduling, gang scheduling, job queuing). Build a project: set up a multi-GPU training job on Kubernetes, profile it, optimize data loading, and implement checkpointing. Even a 2-GPU setup demonstrates the concepts at a smaller scale.
How important is distributed systems theory (CAP theorem, consensus, etc.)?
Moderately important. You should understand fundamental concepts: consistency vs availability trade-offs, consensus protocols (used in checkpointing coordination), fault tolerance patterns (retry, failover, replication). You do not need to implement Raft or Paxos. More practically, understand how distributed training achieves consistency (synchronous AllReduce = strong consistency of model weights), fault tolerance (checkpointing + elastic training), and partition tolerance (training stalls on network partition — it chooses consistency over availability).
What are the most common reasons AI infrastructure candidates fail interviews?
Based on interviewer feedback: (1) Surface-level GPU knowledge — can say "H100 has 80 GB HBM" but cannot explain why memory bandwidth matters more than TFLOPS for large model training. (2) No hands-on experience — can discuss distributed training theoretically but has never debugged an NCCL timeout or profiled GPU utilization. (3) Cannot estimate costs — proposes a 1,000-GPU cluster without estimating the $5M+ price tag. (4) Ignores failure modes — designs infrastructure for the happy path without discussing what happens when GPUs fail, networks partition, or storage fills up. (5) Cloud-only thinking — cannot reason about bare-metal, InfiniBand, or hardware-level optimizations.
How do I prepare for the system design round specifically?
Practice designing these systems from scratch: (1) GPU cluster for training a 100B model. (2) Model serving platform with autoscaling and A/B testing. (3) Multi-tenant ML platform for 10 teams sharing 500 GPUs. (4) Data pipeline for a 10 PB training dataset. For each, cover: hardware selection with justification, networking topology, storage architecture, scheduling and resource management, fault tolerance, monitoring, and cost estimate. Practice whiteboarding these designs in 35 minutes (the actual time you will have after clarifying questions).
Are certifications like CKA (Certified Kubernetes Administrator) useful?
CKA is moderately useful — it demonstrates Kubernetes operational competence and helps get past resume screening. However, it covers zero GPU-specific content. More impactful: contribute to open-source projects in the AI infrastructure space (NVIDIA GPU Operator, Kueue, vLLM, DeepSpeed), publish benchmarks comparing GPU configurations, or write blog posts about production GPU cluster operations. These demonstrate hands-on experience that certifications cannot.
Final Checklist
- Explain GPU architecture: SMs, CUDA cores, Tensor Cores, memory hierarchy, NVLink, NVSwitch
- Calculate GPU memory requirements for training a given model size (params + optimizer + gradients + activations)
- Describe data parallelism, tensor parallelism, pipeline parallelism, and when to use each
- Explain AllReduce (ring algorithm), NCCL, and how to debug NCCL timeouts
- Compare DeepSpeed ZeRO stages (1, 2, 3) and FSDP with trade-offs
- Configure GPU scheduling on Kubernetes: device plugin, GPU operator, gang scheduling
- Compare Volcano vs Kueue for ML job scheduling with specific use cases
- Design a model serving architecture with autoscaling, load balancing, and versioning
- Compare SageMaker, Vertex AI, and Azure ML with strengths and weaknesses
- Explain InfiniBand, RDMA, GPUDirect, and why they matter for distributed training
- Estimate training costs for a given model size, dataset, and GPU configuration
- Design checkpoint storage strategy with hot/warm/cold tiers
- Tell 3 incident stories about GPU cluster failures with detection, mitigation, and prevention
- Whiteboard a complete GPU cluster design for LLM training in under 35 minutes
Lilly Tech Systems