Advanced

AI Accelerator Best Practices

Practical guidance for selecting, optimizing, and managing AI hardware for training and inference workloads.

Hardware Selection Guide

WorkloadRecommended HardwareReasoning
LLM training (>10B params)H100/B200 clusterNeed large memory, fast interconnect, mature ecosystem
LLM training (JAX)TPU v5p podCost-effective for JAX workloads, excellent scaling
LLM inferenceH100, L4, or Inferentia2Depends on latency vs throughput vs cost priorities
Computer vision trainingA100 or H100Mature PyTorch ecosystem, good batch processing
On-device inferenceApple Neural Engine, Qualcomm NPUPower-efficient, privacy-preserving
Research/prototypingRTX 4090 or cloud spot instancesBest price-performance for experimentation

Cost Optimization

  1. Use Spot/Preemptible Instances

    60-90% cheaper than on-demand. Implement checkpointing to handle interruptions gracefully.

  2. Right-Size Your Hardware

    Don't use H100s for tasks that run fine on L4s. Profile your workload to understand actual GPU utilization.

  3. Optimize Before Scaling

    Mixed precision, torch.compile, and efficient data loading can double throughput before you add more GPUs.

  4. Use Quantization for Inference

    INT8 or INT4 quantization can reduce model size 2-4x with minimal accuracy loss, enabling cheaper GPUs.

  5. Monitor and Auto-Scale

    Track GPU utilization and scale down during off-peak hours. Don't pay for idle GPUs.

Framework Compatibility

  • PyTorch: Best on NVIDIA GPUs (CUDA). Experimental support for TPUs via torch_xla. Growing Apple Silicon support.
  • JAX: First-class support on TPUs and GPUs. XLA compilation for both platforms. Best choice for TPU workloads.
  • TensorFlow: Good TPU support, solid NVIDIA GPU support. Less popular for new projects.
  • MLX: Apple Silicon only. Great for local LLM inference on Mac.

Frequently Asked Questions

Use GPUs if you need PyTorch, maximum flexibility, or want to run on any cloud. Use TPUs if you use JAX/TensorFlow, need very large-scale training, and are on Google Cloud. GPUs have a larger ecosystem; TPUs can be more cost-effective for specific workloads.

NVIDIA's CUDA ecosystem creates strong lock-in. However, competition is increasing: Google TPUs, AMD MI300X, AWS Trainium, and custom chips from Microsoft and Meta. The trend is toward more hardware diversity, but NVIDIA will likely remain the default choice for most workloads in the near term.

Use the scaling law: compute (FLOPS) = 6 × model_params × training_tokens. Divide by your GPU's actual TFLOPS (typically 30-50% of peak) to get training time. Multiply by hourly cost. For a 7B model on 1T tokens: ~4.2e22 FLOPS, roughly 1,000 H100-hours (~$3,000 on spot instances).

Congratulations! You have completed the TPU & AI Accelerators course. You now understand the landscape of AI hardware, from TPUs and GPUs to NPUs, and can make informed decisions about hardware selection, benchmarking, and cost optimization.