Advanced

AI Accelerator Benchmarks

Comparing AI hardware requires understanding benchmark methodologies, real-world performance metrics, and cost-efficiency analysis.

MLPerf

MLPerf is the industry-standard benchmark suite for AI hardware, maintained by MLCommons. It provides apples-to-apples comparisons across hardware platforms:

  • MLPerf Training: Measures time to train models to target accuracy (ResNet-50, BERT, GPT-3, Stable Diffusion)
  • MLPerf Inference: Measures throughput and latency for serving models
  • Categories: Closed (strict rules for fair comparison) and Open (any optimization allowed)

Key Performance Metrics

MetricDescriptionWhen It Matters
TFLOPSPeak theoretical floating-point operations per secondRaw compute capability comparison
Tokens/secondLLM inference throughputChat applications, batch processing
Time-to-trainWall-clock time to reach target accuracyTraining efficiency comparison
Latency (P50/P99)Time for a single inference requestReal-time applications, user experience
$/TFLOPSCost per unit of computeBudget-constrained decisions
TFLOPS/wattEnergy efficiencyData center power constraints, edge

Throughput vs Latency

These two metrics often conflict:

  • High throughput: Process as many requests as possible (batch processing, offline analysis). Favor large batches, high GPU utilization.
  • Low latency: Respond to each request as fast as possible (chatbots, real-time systems). Favor small batches, fast individual inference.
  • Trade-off: Larger batch sizes improve throughput but increase latency for individual requests. Continuous batching (used in vLLM, TGI) helps balance both.

Cost Analysis

Raw performance is only part of the picture. Total cost of ownership matters:

FactorGPU (H100)TPU (v5e)Cloud Inference (L4)
Cloud hourly rate~$3.00/hr~$1.20/hr~$0.70/hr
Best forGeneral training & inferenceLarge-scale JAX trainingCost-efficient inference
EcosystemUniversal (PyTorch, TF, JAX)JAX/TF optimizedUniversal
AvailabilityAll cloudsGoogle Cloud onlyAll clouds

Running Your Own Benchmarks

Python - Simple GPU Benchmark
import torch
import time

def benchmark_matmul(size=4096, dtype=torch.float16, warmup=10, runs=100):
    a = torch.randn(size, size, device="cuda", dtype=dtype)
    b = torch.randn(size, size, device="cuda", dtype=dtype)

    # Warmup
    for _ in range(warmup):
        torch.mm(a, b)
    torch.cuda.synchronize()

    # Benchmark
    start = time.perf_counter()
    for _ in range(runs):
        torch.mm(a, b)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    flops = 2 * size**3 * runs
    tflops = flops / elapsed / 1e12
    print(f"Matrix size: {size}, TFLOPS: {tflops:.1f}")

benchmark_matmul()
Key takeaway: Don't rely on peak TFLOPS alone. Consider real-world throughput, latency, cost per token, energy efficiency, and software ecosystem compatibility. MLPerf provides standardized benchmarks, but always validate with your specific workload.