Advanced

AI Accelerator Benchmarks

Comparing AI hardware requires understanding benchmark methodologies, real-world performance metrics, and cost-efficiency analysis.

MLPerf

MLPerf is the industry-standard benchmark suite for AI hardware, maintained by MLCommons. It provides apples-to-apples comparisons across hardware platforms:

MLPerf Training: Measures time to train models to target accuracy (ResNet-50, BERT, GPT-3, Stable Diffusion)
MLPerf Inference: Measures throughput and latency for serving models
Categories: Closed (strict rules for fair comparison) and Open (any optimization allowed)

Key Performance Metrics

Metric	Description	When It Matters
TFLOPS	Peak theoretical floating-point operations per second	Raw compute capability comparison
Tokens/second	LLM inference throughput	Chat applications, batch processing
Time-to-train	Wall-clock time to reach target accuracy	Training efficiency comparison
Latency (P50/P99)	Time for a single inference request	Real-time applications, user experience
$/TFLOPS	Cost per unit of compute	Budget-constrained decisions
TFLOPS/watt	Energy efficiency	Data center power constraints, edge

Throughput vs Latency

These two metrics often conflict:

High throughput: Process as many requests as possible (batch processing, offline analysis). Favor large batches, high GPU utilization.
Low latency: Respond to each request as fast as possible (chatbots, real-time systems). Favor small batches, fast individual inference.
Trade-off: Larger batch sizes improve throughput but increase latency for individual requests. Continuous batching (used in vLLM, TGI) helps balance both.

Cost Analysis

Raw performance is only part of the picture. Total cost of ownership matters:

Factor	GPU (H100)	TPU (v5e)	Cloud Inference (L4)
Cloud hourly rate	~$3.00/hr	~$1.20/hr	~$0.70/hr
Best for	General training & inference	Large-scale JAX training	Cost-efficient inference
Ecosystem	Universal (PyTorch, TF, JAX)	JAX/TF optimized	Universal
Availability	All clouds	Google Cloud only	All clouds

Running Your Own Benchmarks

Python - Simple GPU Benchmark

import torch
import time

def benchmark_matmul(size=4096, dtype=torch.float16, warmup=10, runs=100):
    a = torch.randn(size, size, device="cuda", dtype=dtype)
    b = torch.randn(size, size, device="cuda", dtype=dtype)

    # Warmup
    for _ in range(warmup):
        torch.mm(a, b)
    torch.cuda.synchronize()

    # Benchmark
    start = time.perf_counter()
    for _ in range(runs):
        torch.mm(a, b)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    flops = 2 * size**3 * runs
    tflops = flops / elapsed / 1e12
    print(f"Matrix size: {size}, TFLOPS: {tflops:.1f}")

benchmark_matmul()

✅

Key takeaway: Don't rely on peak TFLOPS alone. Consider real-world throughput, latency, cost per token, energy efficiency, and software ecosystem compatibility. MLPerf provides standardized benchmarks, but always validate with your specific workload.

← Previous NVIDIA Hardware Next → Best Practices