Advanced
AI Accelerator Benchmarks
Comparing AI hardware requires understanding benchmark methodologies, real-world performance metrics, and cost-efficiency analysis.
MLPerf
MLPerf is the industry-standard benchmark suite for AI hardware, maintained by MLCommons. It provides apples-to-apples comparisons across hardware platforms:
- MLPerf Training: Measures time to train models to target accuracy (ResNet-50, BERT, GPT-3, Stable Diffusion)
- MLPerf Inference: Measures throughput and latency for serving models
- Categories: Closed (strict rules for fair comparison) and Open (any optimization allowed)
Key Performance Metrics
| Metric | Description | When It Matters |
|---|---|---|
| TFLOPS | Peak theoretical floating-point operations per second | Raw compute capability comparison |
| Tokens/second | LLM inference throughput | Chat applications, batch processing |
| Time-to-train | Wall-clock time to reach target accuracy | Training efficiency comparison |
| Latency (P50/P99) | Time for a single inference request | Real-time applications, user experience |
| $/TFLOPS | Cost per unit of compute | Budget-constrained decisions |
| TFLOPS/watt | Energy efficiency | Data center power constraints, edge |
Throughput vs Latency
These two metrics often conflict:
- High throughput: Process as many requests as possible (batch processing, offline analysis). Favor large batches, high GPU utilization.
- Low latency: Respond to each request as fast as possible (chatbots, real-time systems). Favor small batches, fast individual inference.
- Trade-off: Larger batch sizes improve throughput but increase latency for individual requests. Continuous batching (used in vLLM, TGI) helps balance both.
Cost Analysis
Raw performance is only part of the picture. Total cost of ownership matters:
| Factor | GPU (H100) | TPU (v5e) | Cloud Inference (L4) |
|---|---|---|---|
| Cloud hourly rate | ~$3.00/hr | ~$1.20/hr | ~$0.70/hr |
| Best for | General training & inference | Large-scale JAX training | Cost-efficient inference |
| Ecosystem | Universal (PyTorch, TF, JAX) | JAX/TF optimized | Universal |
| Availability | All clouds | Google Cloud only | All clouds |
Running Your Own Benchmarks
Python - Simple GPU Benchmark
import torch import time def benchmark_matmul(size=4096, dtype=torch.float16, warmup=10, runs=100): a = torch.randn(size, size, device="cuda", dtype=dtype) b = torch.randn(size, size, device="cuda", dtype=dtype) # Warmup for _ in range(warmup): torch.mm(a, b) torch.cuda.synchronize() # Benchmark start = time.perf_counter() for _ in range(runs): torch.mm(a, b) torch.cuda.synchronize() elapsed = time.perf_counter() - start flops = 2 * size**3 * runs tflops = flops / elapsed / 1e12 print(f"Matrix size: {size}, TFLOPS: {tflops:.1f}") benchmark_matmul()
Key takeaway: Don't rely on peak TFLOPS alone. Consider real-world throughput, latency, cost per token, energy efficiency, and software ecosystem compatibility. MLPerf provides standardized benchmarks, but always validate with your specific workload.