AI Hardware

Master AI hardware end-to-end. 50 topics covering NVIDIA H100/H200/Blackwell, AMD MI300, Apple Silicon, Intel Gaudi, TPUs, Trainium, Cerebras, Groq, the GPU software stack (CUDA, Triton, ROCm, NCCL), inference engines (TensorRT-LLM, vLLM, SGLang, llama.cpp, MLX), memory and interconnect (HBM, NVLink, InfiniBand, RDMA, CXL), edge AI (Jetson, Coral, mobile NPUs), and AI data centers.

50 Topics
300 Lessons
7 Categories
100% Free

All Topics

50 topics organized into 7 categories spanning the full AI hardware stack.

GPUs

🎯

NVIDIA H100 Mastery

Master the NVIDIA H100 — the workhorse GPU of the LLM era. Learn the Hopper architecture, FP8 training, transformer engine, NVLink, and the patterns that get you peak throughput.

6 Lessons
🚀

NVIDIA H200

Master the H200 — the H100 with 141GB of HBM3e and ~1.4x faster LLM inference. Learn what changed, when it matters, and the migration patterns from H100.

6 Lessons
🌟

NVIDIA B200/Blackwell

Master the Blackwell B200 and GB200 — NVIDIA's frontier GPU with FP4, second-gen transformer engine, and 5th-gen NVLink. Learn what 1 exaflop in a rack means for AI.

6 Lessons

NVIDIA A100

Master the A100 — still the workhorse for many AI workloads. Learn Ampere architecture, BF16, structured sparsity, MIG, and when A100 still beats H100 on $/throughput.

6 Lessons
🎥

NVIDIA L40S and L4

Master the L40S (Ada workstation/inference) and L4 (efficient inference). Learn when these beat H100/H200 for inference and graphics-AI hybrid workloads.

6 Lessons
🧠

NVIDIA Grace Hopper / Grace Blackwell

Master Grace Hopper (GH200) and Grace Blackwell (GB200) superchips — CPU+GPU sharing memory via NVLink-C2C. Learn unified memory, when it shines, and migration patterns.

6 Lessons
🔥

AMD MI300X / MI325X

Master AMD's flagship AI GPUs. Learn CDNA architecture, 192GB HBM3, ROCm software stack, and when MI300X/MI325X are a credible alternative to H100/H200.

6 Lessons
📱

Apple Silicon for AI (M3/M4)

Master Apple's M3 and M4 chips for AI work. Learn unified memory, Neural Engine, MLX framework, and the patterns for running 70B models on a MacBook.

6 Lessons
🔬

Intel Gaudi 2 / Gaudi 3

Master Intel Gaudi 2 and Gaudi 3 AI accelerators. Learn the architecture, SynapseAI software, integrated networking, and when Gaudi beats GPUs on $/training.

6 Lessons
🎮

Consumer GPUs for AI (RTX 4090/5090)

Master consumer GPUs for AI: RTX 4090, RTX 5090, dual-GPU rigs. Learn what fits in 24GB, the cost-per-token math, and when consumer GPUs beat datacenter parts.

6 Lessons

Specialized Accelerators

🔭

Google TPU v5p / Trillium

Master Google TPUs — purpose-built ML accelerators. Learn TPU v5p, Trillium (v6), MXU, JAX integration, and the patterns for training and serving on TPU pods.

6 Lessons

AWS Trainium 2

Master AWS Trainium 2 — purpose-built training chip. Learn Neuron SDK, NKI kernels, Trn2 instances, and when Trainium beats GPUs on $/training-hour.

6 Lessons
🔥

AWS Inferentia

Master AWS Inferentia 2 for low-cost inference. Learn Inf2 instances, model compilation, batching patterns, and when Inferentia beats GPUs for inference.

6 Lessons
🏹

Cerebras Wafer-Scale Engine

Master Cerebras WSE — the world's largest chip. Learn the wafer-scale architecture, weight streaming, and the patterns for fastest-on-earth LLM inference.

6 Lessons

Groq LPU

Master Groq LPU — deterministic inference at insane speed. Learn the Tensor Streaming Processor, why it's so fast on LLMs, and when Groq fits your workload.

6 Lessons
📊

SambaNova RDU

Master SambaNova RDU (Reconfigurable Dataflow Unit). Learn the dataflow architecture, SambaFlow software, and when RDUs beat GPUs for specific AI workloads.

6 Lessons
🔗

Graphcore IPU

Master Graphcore IPU — Intelligence Processing Unit. Learn the MIMD architecture, Poplar SDK, BOW IPU, and the niches where IPUs excel.

6 Lessons
🔮

Tenstorrent Wormhole / Blackhole

Master Tenstorrent: open-source RISC-V AI hardware. Learn Wormhole and Blackhole chips, TT-Metal SDK, and the open-hardware approach to AI acceleration.

6 Lessons
🛡

Etched Sohu (Transformer ASIC)

Master Etched Sohu — the first transformer-only ASIC. Learn what 'transformer baked in silicon' means, performance claims vs reality, and the implications.

6 Lessons
📝

Microsoft Maia 100

Master Microsoft Maia 100 — Azure's custom AI accelerator. Learn the architecture, Azure integration, and Microsoft's silicon strategy for OpenAI workloads.

6 Lessons

GPU Software Stack

💻

CUDA Programming for AI

Master CUDA C++ for AI work. Learn the programming model, kernels, shared memory, warp-level primitives, and the patterns to write GPU code that beats vendor libraries.

6 Lessons

OpenAI Triton Compiler

Master OpenAI Triton — Python-like GPU programming that compiles to high-performance kernels. Learn block programming, autotuning, and the patterns that beat hand-written CUDA.

6 Lessons
🛡

AMD ROCm Stack

Master ROCm — AMD's open-source GPU compute platform. Learn HIP, ROCm libraries, PyTorch on ROCm, and porting CUDA code to ROCm.

6 Lessons
🧠

cuDNN Deep Learning Library

Master cuDNN — NVIDIA's deep neural network primitives. Learn convolution algorithms, attention, batchnorm, and how to call cuDNN directly when frameworks fall short.

6 Lessons
🔬

cuBLAS and cuSPARSE

Master cuBLAS (dense) and cuSPARSE (sparse) linear algebra on GPU. Learn GEMM tuning, batched ops, mixed precision, and the patterns for max FLOPS.

6 Lessons
🔗

NCCL Collectives

Master NCCL — NVIDIA's multi-GPU communication library. Learn all-reduce, all-gather, broadcast, NVLink/InfiniBand topology, and tuning collectives for distributed training.

6 Lessons
📥

NVIDIA NIM Microservices

Master NVIDIA NIM — pre-optimized inference microservices. Learn how to deploy NIMs on Kubernetes, customize, and integrate with your existing AI stack.

6 Lessons
🚀

NVIDIA Dynamo

Master NVIDIA Dynamo — distributed inference framework for LLMs. Learn disaggregated prefill/decode, KV cache routing, and the patterns for max throughput at scale.

6 Lessons

Inference Engines

🎯

TensorRT-LLM

Master TensorRT-LLM — NVIDIA's optimized LLM inference engine. Learn engine compilation, FP8/INT4 quantization, in-flight batching, and the patterns for peak throughput.

6 Lessons

vLLM Internals

Go beyond using vLLM — master its internals. Learn PagedAttention, continuous batching, scheduler, KV cache, and the patterns to tune vLLM for your workload.

6 Lessons
🚀

SGLang Fast Inference

Master SGLang — fast LLM serving with structured generation. Learn RadixAttention, the SGLang frontend language, and when SGLang beats vLLM and TGI.

6 Lessons
🤗

HuggingFace TGI

Master HuggingFace Text Generation Inference. Learn deployment, quantization, multi-LoRA serving, and the patterns for production HuggingFace inference.

6 Lessons
🤗

llama.cpp Mastery

Master llama.cpp — the C++ inference engine that runs LLMs on anything. Learn GGUF, quantization formats, Metal/CUDA backends, and tuning for CPU, GPU, and edge.

6 Lessons
🍏

Apple MLX

Master Apple MLX — array framework optimized for Apple Silicon. Learn unified memory, lazy evaluation, MLX-LM for fast Apple Silicon LLM inference.

6 Lessons
🔗

ONNX Runtime

Master ONNX Runtime — cross-platform inference engine. Learn ONNX format, graph optimization, execution providers (CUDA, TensorRT, CPU), and the deployment patterns.

6 Lessons
🔬

Intel OpenVINO

Master Intel OpenVINO — inference toolkit for CPU, GPU, NPU, FPGA. Learn model conversion, optimization, deployment to Intel hardware, and edge inference patterns.

6 Lessons

Memory & Interconnect

Edge AI Hardware

Data Center & Cloud

Why an AI Hardware Track?

AI is now the most expensive workload in the data center. Understanding the silicon is what lets you ship fast, cheap, and reliably.

🎯

Every Major Vendor

NVIDIA, AMD, Apple, Intel, Google TPU, AWS Trainium, Cerebras, Groq, SambaNova, Graphcore, Tenstorrent, Etched, Microsoft Maia.

💻

The Software Stack

CUDA, Triton, ROCm, cuDNN, cuBLAS, NCCL, NIM, Dynamo — the layers between your model and the metal.

Inference Engines

TensorRT-LLM, vLLM, SGLang, TGI, llama.cpp, MLX, ONNX Runtime, OpenVINO — pick the right engine for the job.

🏠

Edge to Data Center

Jetson, Coral, mobile NPUs all the way up to 100kW racks, liquid cooling, and hyperscale AI clusters.