Intermediate

NVIDIA AI Hardware

NVIDIA dominates the AI accelerator market with GPUs ranging from consumer GeForce cards to data center H100 and Blackwell chips.

Data Center GPU Lineup

GPUArchitectureFP16 TFLOPSMemoryBandwidth
A100Ampere (2020)31280 GB HBM2e2.0 TB/s
H100 SXMHopper (2022)98980 GB HBM33.35 TB/s
H200Hopper (2024)989141 GB HBM3e4.8 TB/s
B200Blackwell (2024)2,250192 GB HBM3e8.0 TB/s
GB200Grace-Blackwell2,250192 GB HBM3e8.0 TB/s + Grace CPU

Tensor Cores

Tensor Cores are specialized processing units on NVIDIA GPUs designed for matrix multiply-accumulate operations:

  • 4th Gen (Hopper): Support FP8, FP16, BF16, TF32, INT8, FP64 matrix operations
  • Operation: Perform D = A × B + C where A, B, C, D are matrices, in a single clock cycle per warp
  • FP8 support: H100 and newer support FP8 for ~2x throughput over FP16 with minimal accuracy loss
  • Sparsity: Structured sparsity (2:4 pattern) doubles Tensor Core throughput for sparse models

NVLink & NVSwitch

GPU interconnect technology for multi-GPU systems:

  • NVLink (4th gen): 900 GB/s bidirectional bandwidth between GPUs on H100
  • NVSwitch: Full-bandwidth, non-blocking switch connecting all GPUs in a node
  • DGX H100: 8 H100 GPUs connected via NVSwitch with 900 GB/s per GPU pair
  • NVLink Network: Extends NVLink across nodes for rack-scale GPU clusters

NVIDIA Software Platform

SoftwarePurpose
CUDAGeneral-purpose GPU programming platform
cuDNNOptimized deep learning primitives
TensorRTInference optimization and deployment
Triton Inference ServerProduction model serving
NCCLMulti-GPU collective communication
NeMoLLM training and customization framework

Choosing an NVIDIA GPU

  • Learning/prototyping: RTX 4070 or 4080 (consumer, 12-16 GB)
  • Small-scale training: RTX 4090 (24 GB) or A6000 (48 GB)
  • Large-scale training: H100 or B200 (cloud instances)
  • Inference at scale: L4 (cost-optimized) or H100 (high-throughput)
Key takeaway: NVIDIA's hardware advantage comes from Tensor Cores, high-bandwidth memory, NVLink interconnects, and the comprehensive CUDA software ecosystem. The H100 and Blackwell architectures represent the current state-of-the-art for AI training and inference.