AI Accelerator Best Practices
Practical guidance for selecting, optimizing, and managing AI hardware for training and inference workloads.
Hardware Selection Guide
| Workload | Recommended Hardware | Reasoning |
|---|---|---|
| LLM training (>10B params) | H100/B200 cluster | Need large memory, fast interconnect, mature ecosystem |
| LLM training (JAX) | TPU v5p pod | Cost-effective for JAX workloads, excellent scaling |
| LLM inference | H100, L4, or Inferentia2 | Depends on latency vs throughput vs cost priorities |
| Computer vision training | A100 or H100 | Mature PyTorch ecosystem, good batch processing |
| On-device inference | Apple Neural Engine, Qualcomm NPU | Power-efficient, privacy-preserving |
| Research/prototyping | RTX 4090 or cloud spot instances | Best price-performance for experimentation |
Cost Optimization
Use Spot/Preemptible Instances
60-90% cheaper than on-demand. Implement checkpointing to handle interruptions gracefully.
Right-Size Your Hardware
Don't use H100s for tasks that run fine on L4s. Profile your workload to understand actual GPU utilization.
Optimize Before Scaling
Mixed precision, torch.compile, and efficient data loading can double throughput before you add more GPUs.
Use Quantization for Inference
INT8 or INT4 quantization can reduce model size 2-4x with minimal accuracy loss, enabling cheaper GPUs.
Monitor and Auto-Scale
Track GPU utilization and scale down during off-peak hours. Don't pay for idle GPUs.
Framework Compatibility
- PyTorch: Best on NVIDIA GPUs (CUDA). Experimental support for TPUs via torch_xla. Growing Apple Silicon support.
- JAX: First-class support on TPUs and GPUs. XLA compilation for both platforms. Best choice for TPU workloads.
- TensorFlow: Good TPU support, solid NVIDIA GPU support. Less popular for new projects.
- MLX: Apple Silicon only. Great for local LLM inference on Mac.
Frequently Asked Questions
Use GPUs if you need PyTorch, maximum flexibility, or want to run on any cloud. Use TPUs if you use JAX/TensorFlow, need very large-scale training, and are on Google Cloud. GPUs have a larger ecosystem; TPUs can be more cost-effective for specific workloads.
NVIDIA's CUDA ecosystem creates strong lock-in. However, competition is increasing: Google TPUs, AMD MI300X, AWS Trainium, and custom chips from Microsoft and Meta. The trend is toward more hardware diversity, but NVIDIA will likely remain the default choice for most workloads in the near term.
Use the scaling law: compute (FLOPS) = 6 × model_params × training_tokens. Divide by your GPU's actual TFLOPS (typically 30-50% of peak) to get training time. Multiply by hourly cost. For a 7B model on 1T tokens: ~4.2e22 FLOPS, roughly 1,000 H100-hours (~$3,000 on spot instances).