Intermediate
NVIDIA AI Hardware
NVIDIA dominates the AI accelerator market with GPUs ranging from consumer GeForce cards to data center H100 and Blackwell chips.
Data Center GPU Lineup
| GPU | Architecture | FP16 TFLOPS | Memory | Bandwidth |
|---|---|---|---|---|
| A100 | Ampere (2020) | 312 | 80 GB HBM2e | 2.0 TB/s |
| H100 SXM | Hopper (2022) | 989 | 80 GB HBM3 | 3.35 TB/s |
| H200 | Hopper (2024) | 989 | 141 GB HBM3e | 4.8 TB/s |
| B200 | Blackwell (2024) | 2,250 | 192 GB HBM3e | 8.0 TB/s |
| GB200 | Grace-Blackwell | 2,250 | 192 GB HBM3e | 8.0 TB/s + Grace CPU |
Tensor Cores
Tensor Cores are specialized processing units on NVIDIA GPUs designed for matrix multiply-accumulate operations:
- 4th Gen (Hopper): Support FP8, FP16, BF16, TF32, INT8, FP64 matrix operations
- Operation: Perform D = A × B + C where A, B, C, D are matrices, in a single clock cycle per warp
- FP8 support: H100 and newer support FP8 for ~2x throughput over FP16 with minimal accuracy loss
- Sparsity: Structured sparsity (2:4 pattern) doubles Tensor Core throughput for sparse models
NVLink & NVSwitch
GPU interconnect technology for multi-GPU systems:
- NVLink (4th gen): 900 GB/s bidirectional bandwidth between GPUs on H100
- NVSwitch: Full-bandwidth, non-blocking switch connecting all GPUs in a node
- DGX H100: 8 H100 GPUs connected via NVSwitch with 900 GB/s per GPU pair
- NVLink Network: Extends NVLink across nodes for rack-scale GPU clusters
NVIDIA Software Platform
| Software | Purpose |
|---|---|
| CUDA | General-purpose GPU programming platform |
| cuDNN | Optimized deep learning primitives |
| TensorRT | Inference optimization and deployment |
| Triton Inference Server | Production model serving |
| NCCL | Multi-GPU collective communication |
| NeMo | LLM training and customization framework |
Choosing an NVIDIA GPU
- Learning/prototyping: RTX 4070 or 4080 (consumer, 12-16 GB)
- Small-scale training: RTX 4090 (24 GB) or A6000 (48 GB)
- Large-scale training: H100 or B200 (cloud instances)
- Inference at scale: L4 (cost-optimized) or H100 (high-throughput)
Key takeaway: NVIDIA's hardware advantage comes from Tensor Cores, high-bandwidth memory, NVLink interconnects, and the comprehensive CUDA software ecosystem. The H100 and Blackwell architectures represent the current state-of-the-art for AI training and inference.
Lilly Tech Systems