Intermediate

NVIDIA AI Hardware

NVIDIA dominates the AI accelerator market with GPUs ranging from consumer GeForce cards to data center H100 and Blackwell chips.

Data Center GPU Lineup

GPU	Architecture	FP16 TFLOPS	Memory	Bandwidth
A100	Ampere (2020)	312	80 GB HBM2e	2.0 TB/s
H100 SXM	Hopper (2022)	989	80 GB HBM3	3.35 TB/s
H200	Hopper (2024)	989	141 GB HBM3e	4.8 TB/s
B200	Blackwell (2024)	2,250	192 GB HBM3e	8.0 TB/s
GB200	Grace-Blackwell	2,250	192 GB HBM3e	8.0 TB/s + Grace CPU

Tensor Cores

Tensor Cores are specialized processing units on NVIDIA GPUs designed for matrix multiply-accumulate operations:

4th Gen (Hopper): Support FP8, FP16, BF16, TF32, INT8, FP64 matrix operations
Operation: Perform D = A × B + C where A, B, C, D are matrices, in a single clock cycle per warp
FP8 support: H100 and newer support FP8 for ~2x throughput over FP16 with minimal accuracy loss
Sparsity: Structured sparsity (2:4 pattern) doubles Tensor Core throughput for sparse models

NVLink & NVSwitch

GPU interconnect technology for multi-GPU systems:

NVLink (4th gen): 900 GB/s bidirectional bandwidth between GPUs on H100
NVSwitch: Full-bandwidth, non-blocking switch connecting all GPUs in a node
DGX H100: 8 H100 GPUs connected via NVSwitch with 900 GB/s per GPU pair
NVLink Network: Extends NVLink across nodes for rack-scale GPU clusters

NVIDIA Software Platform

Software	Purpose
CUDA	General-purpose GPU programming platform
cuDNN	Optimized deep learning primitives
TensorRT	Inference optimization and deployment
Triton Inference Server	Production model serving
NCCL	Multi-GPU collective communication
NeMo	LLM training and customization framework

Choosing an NVIDIA GPU

Learning/prototyping: RTX 4070 or 4080 (consumer, 12-16 GB)
Small-scale training: RTX 4090 (24 GB) or A6000 (48 GB)
Large-scale training: H100 or B200 (cloud instances)
Inference at scale: L4 (cost-optimized) or H100 (high-throughput)

✅

Key takeaway: NVIDIA's hardware advantage comes from Tensor Cores, high-bandwidth memory, NVLink interconnects, and the comprehensive CUDA software ecosystem. The H100 and Blackwell architectures represent the current state-of-the-art for AI training and inference.

← Previous Apple Neural Engine Next → Benchmarks