AI Hardware
Master AI hardware end-to-end. 50 topics covering NVIDIA H100/H200/Blackwell, AMD MI300, Apple Silicon, Intel Gaudi, TPUs, Trainium, Cerebras, Groq, the GPU software stack (CUDA, Triton, ROCm, NCCL), inference engines (TensorRT-LLM, vLLM, SGLang, llama.cpp, MLX), memory and interconnect (HBM, NVLink, InfiniBand, RDMA, CXL), edge AI (Jetson, Coral, mobile NPUs), and AI data centers.
All Topics
50 topics organized into 7 categories spanning the full AI hardware stack.
GPUs
NVIDIA H100 Mastery
Master the NVIDIA H100 — the workhorse GPU of the LLM era. Learn the Hopper architecture, FP8 training, transformer engine, NVLink, and the patterns that get you peak throughput.
6 LessonsNVIDIA H200
Master the H200 — the H100 with 141GB of HBM3e and ~1.4x faster LLM inference. Learn what changed, when it matters, and the migration patterns from H100.
6 LessonsNVIDIA B200/Blackwell
Master the Blackwell B200 and GB200 — NVIDIA's frontier GPU with FP4, second-gen transformer engine, and 5th-gen NVLink. Learn what 1 exaflop in a rack means for AI.
6 LessonsNVIDIA A100
Master the A100 — still the workhorse for many AI workloads. Learn Ampere architecture, BF16, structured sparsity, MIG, and when A100 still beats H100 on $/throughput.
6 LessonsNVIDIA L40S and L4
Master the L40S (Ada workstation/inference) and L4 (efficient inference). Learn when these beat H100/H200 for inference and graphics-AI hybrid workloads.
6 LessonsNVIDIA Grace Hopper / Grace Blackwell
Master Grace Hopper (GH200) and Grace Blackwell (GB200) superchips — CPU+GPU sharing memory via NVLink-C2C. Learn unified memory, when it shines, and migration patterns.
6 LessonsAMD MI300X / MI325X
Master AMD's flagship AI GPUs. Learn CDNA architecture, 192GB HBM3, ROCm software stack, and when MI300X/MI325X are a credible alternative to H100/H200.
6 LessonsApple Silicon for AI (M3/M4)
Master Apple's M3 and M4 chips for AI work. Learn unified memory, Neural Engine, MLX framework, and the patterns for running 70B models on a MacBook.
6 LessonsIntel Gaudi 2 / Gaudi 3
Master Intel Gaudi 2 and Gaudi 3 AI accelerators. Learn the architecture, SynapseAI software, integrated networking, and when Gaudi beats GPUs on $/training.
6 LessonsConsumer GPUs for AI (RTX 4090/5090)
Master consumer GPUs for AI: RTX 4090, RTX 5090, dual-GPU rigs. Learn what fits in 24GB, the cost-per-token math, and when consumer GPUs beat datacenter parts.
6 LessonsSpecialized Accelerators
Google TPU v5p / Trillium
Master Google TPUs — purpose-built ML accelerators. Learn TPU v5p, Trillium (v6), MXU, JAX integration, and the patterns for training and serving on TPU pods.
6 LessonsAWS Trainium 2
Master AWS Trainium 2 — purpose-built training chip. Learn Neuron SDK, NKI kernels, Trn2 instances, and when Trainium beats GPUs on $/training-hour.
6 LessonsAWS Inferentia
Master AWS Inferentia 2 for low-cost inference. Learn Inf2 instances, model compilation, batching patterns, and when Inferentia beats GPUs for inference.
6 LessonsCerebras Wafer-Scale Engine
Master Cerebras WSE — the world's largest chip. Learn the wafer-scale architecture, weight streaming, and the patterns for fastest-on-earth LLM inference.
6 LessonsGroq LPU
Master Groq LPU — deterministic inference at insane speed. Learn the Tensor Streaming Processor, why it's so fast on LLMs, and when Groq fits your workload.
6 LessonsSambaNova RDU
Master SambaNova RDU (Reconfigurable Dataflow Unit). Learn the dataflow architecture, SambaFlow software, and when RDUs beat GPUs for specific AI workloads.
6 LessonsGraphcore IPU
Master Graphcore IPU — Intelligence Processing Unit. Learn the MIMD architecture, Poplar SDK, BOW IPU, and the niches where IPUs excel.
6 LessonsTenstorrent Wormhole / Blackhole
Master Tenstorrent: open-source RISC-V AI hardware. Learn Wormhole and Blackhole chips, TT-Metal SDK, and the open-hardware approach to AI acceleration.
6 LessonsEtched Sohu (Transformer ASIC)
Master Etched Sohu — the first transformer-only ASIC. Learn what 'transformer baked in silicon' means, performance claims vs reality, and the implications.
6 LessonsMicrosoft Maia 100
Master Microsoft Maia 100 — Azure's custom AI accelerator. Learn the architecture, Azure integration, and Microsoft's silicon strategy for OpenAI workloads.
6 LessonsGPU Software Stack
CUDA Programming for AI
Master CUDA C++ for AI work. Learn the programming model, kernels, shared memory, warp-level primitives, and the patterns to write GPU code that beats vendor libraries.
6 LessonsOpenAI Triton Compiler
Master OpenAI Triton — Python-like GPU programming that compiles to high-performance kernels. Learn block programming, autotuning, and the patterns that beat hand-written CUDA.
6 LessonsAMD ROCm Stack
Master ROCm — AMD's open-source GPU compute platform. Learn HIP, ROCm libraries, PyTorch on ROCm, and porting CUDA code to ROCm.
6 LessonscuDNN Deep Learning Library
Master cuDNN — NVIDIA's deep neural network primitives. Learn convolution algorithms, attention, batchnorm, and how to call cuDNN directly when frameworks fall short.
6 LessonscuBLAS and cuSPARSE
Master cuBLAS (dense) and cuSPARSE (sparse) linear algebra on GPU. Learn GEMM tuning, batched ops, mixed precision, and the patterns for max FLOPS.
6 LessonsNCCL Collectives
Master NCCL — NVIDIA's multi-GPU communication library. Learn all-reduce, all-gather, broadcast, NVLink/InfiniBand topology, and tuning collectives for distributed training.
6 LessonsNVIDIA NIM Microservices
Master NVIDIA NIM — pre-optimized inference microservices. Learn how to deploy NIMs on Kubernetes, customize, and integrate with your existing AI stack.
6 LessonsNVIDIA Dynamo
Master NVIDIA Dynamo — distributed inference framework for LLMs. Learn disaggregated prefill/decode, KV cache routing, and the patterns for max throughput at scale.
6 LessonsInference Engines
TensorRT-LLM
Master TensorRT-LLM — NVIDIA's optimized LLM inference engine. Learn engine compilation, FP8/INT4 quantization, in-flight batching, and the patterns for peak throughput.
6 LessonsvLLM Internals
Go beyond using vLLM — master its internals. Learn PagedAttention, continuous batching, scheduler, KV cache, and the patterns to tune vLLM for your workload.
6 LessonsSGLang Fast Inference
Master SGLang — fast LLM serving with structured generation. Learn RadixAttention, the SGLang frontend language, and when SGLang beats vLLM and TGI.
6 LessonsHuggingFace TGI
Master HuggingFace Text Generation Inference. Learn deployment, quantization, multi-LoRA serving, and the patterns for production HuggingFace inference.
6 Lessonsllama.cpp Mastery
Master llama.cpp — the C++ inference engine that runs LLMs on anything. Learn GGUF, quantization formats, Metal/CUDA backends, and tuning for CPU, GPU, and edge.
6 LessonsApple MLX
Master Apple MLX — array framework optimized for Apple Silicon. Learn unified memory, lazy evaluation, MLX-LM for fast Apple Silicon LLM inference.
6 LessonsONNX Runtime
Master ONNX Runtime — cross-platform inference engine. Learn ONNX format, graph optimization, execution providers (CUDA, TensorRT, CPU), and the deployment patterns.
6 LessonsIntel OpenVINO
Master Intel OpenVINO — inference toolkit for CPU, GPU, NPU, FPGA. Learn model conversion, optimization, deployment to Intel hardware, and edge inference patterns.
6 LessonsMemory & Interconnect
HBM (High Bandwidth Memory)
Master HBM — the memory tech that powers AI. Learn HBM2/HBM3/HBM3e, stacking, bandwidth math, and how memory bandwidth becomes the LLM bottleneck.
6 LessonsNVLink and NVSwitch
Master NVLink and NVSwitch — NVIDIA's GPU interconnect. Learn the bandwidth gen-by-gen, topology, NVL72 racks, and the patterns for max all-reduce performance.
6 LessonsInfiniBand for AI Clusters
Master InfiniBand for AI clusters. Learn 400/800Gb HDR/NDR/XDR, Mellanox/NVIDIA Quantum switches, fat-tree topology, and the patterns for cluster networking.
6 LessonsRDMA and RoCE
Master RDMA — Remote Direct Memory Access. Learn RoCE v2 over Ethernet, GPUDirect RDMA, and the patterns for low-latency multi-host GPU communication.
6 LessonsCXL Memory Pooling
Master CXL — Compute Express Link. Learn cache-coherent memory pooling, CXL 2.0/3.0, memory disaggregation, and the patterns for tomorrow's AI infrastructure.
6 LessonsPCIe for AI
Master PCIe for AI: Gen4, Gen5, Gen6. Learn lanes, bifurcation, bandwidth math, and when PCIe becomes the bottleneck for GPU-to-GPU and GPU-to-host data flow.
6 LessonsEdge AI Hardware
NVIDIA Jetson
Master NVIDIA Jetson: Orin Nano, Orin NX, AGX Orin. Learn the architecture, JetPack SDK, deployment patterns, and the use cases where Jetson dominates edge AI.
6 LessonsGoogle Coral Edge TPU
Master Google Coral Edge TPU — fast, low-power AI for the edge. Learn the Edge TPU compiler, quantization requirements, and the deployment patterns.
6 LessonsMobile AI NPUs (Apple, Qualcomm)
Master mobile AI accelerators: Apple Neural Engine, Qualcomm Hexagon NPU, Samsung NPU. Learn the on-device AI patterns for iOS and Android.
6 LessonsHailo Edge Accelerators
Master Hailo-8 and Hailo-15 edge accelerators. Learn the dataflow architecture, Hailo SDK, and when Hailo beats Jetson and Coral for edge inference.
6 LessonsRaspberry Pi AI HAT
Master Raspberry Pi for AI: Pi 5, AI HAT (Hailo), AI Camera. Learn what's possible on a $100 board, deployment patterns, and the hobbyist-to-production path.
6 LessonsEdge Inference Optimization
Optimize models for edge: quantization, pruning, distillation, hardware-aware NAS. Learn the patterns to ship sub-100ms, sub-100MB AI on phones and embedded devices.
6 LessonsData Center & Cloud
AI Data Center Design and Cooling
Design AI data centers from the rack up. Learn power density (100kW+ racks), liquid cooling, immersion, sustainability, and the patterns for hyperscale AI infrastructure.
6 LessonsCloud GPU Providers
Pick the right cloud GPU provider. Compare AWS, GCP, Azure, Oracle, Lambda Labs, CoreWeave, RunPod, Modal, Together, Replicate. Learn pricing, availability, and quirks.
6 LessonsWhy an AI Hardware Track?
AI is now the most expensive workload in the data center. Understanding the silicon is what lets you ship fast, cheap, and reliably.
Every Major Vendor
NVIDIA, AMD, Apple, Intel, Google TPU, AWS Trainium, Cerebras, Groq, SambaNova, Graphcore, Tenstorrent, Etched, Microsoft Maia.
The Software Stack
CUDA, Triton, ROCm, cuDNN, cuBLAS, NCCL, NIM, Dynamo — the layers between your model and the metal.
Inference Engines
TensorRT-LLM, vLLM, SGLang, TGI, llama.cpp, MLX, ONNX Runtime, OpenVINO — pick the right engine for the job.
Edge to Data Center
Jetson, Coral, mobile NPUs all the way up to 100kW racks, liquid cooling, and hyperscale AI clusters.
Lilly Tech Systems