Intermediate

cuDNN — Deep Learning Primitives

cuDNN is NVIDIA's GPU-accelerated library of primitives for deep neural networks. It powers the backend of PyTorch, TensorFlow, and most deep learning frameworks.

What cuDNN Provides

Rather than writing custom CUDA kernels for every neural network operation, cuDNN provides highly optimized implementations:

  • Convolutions: Forward and backward passes for 2D/3D convolutions with multiple algorithm choices (Winograd, FFT, implicit GEMM)
  • Pooling: Max pooling, average pooling, and adaptive pooling
  • Normalization: Batch normalization, layer normalization, group normalization, and instance normalization
  • Activation functions: ReLU, sigmoid, tanh, GELU, and SiLU with fused operations
  • RNNs: LSTM, GRU, and vanilla RNN with optimized multi-layer implementations
  • Attention: Flash Attention and fused multi-head attention for transformers

Algorithm Selection

cuDNN offers multiple algorithms for the same operation. The best choice depends on input sizes, filter sizes, and hardware:

AlgorithmBest ForTrade-off
Implicit GEMMGeneral purposeReliable performance, moderate memory
Winograd3x3 convolutionsFastest for small filters, higher numerical error
FFTLarge filtersFast for big filters, high memory usage
Tensor CoreFP16/BF16 operationsFastest on modern GPUs, requires aligned shapes

cuDNN in PyTorch

Python - cuDNN Configuration in PyTorch
import torch

# Enable cuDNN auto-tuner: benchmarks multiple
# algorithms and picks the fastest for your input size
torch.backends.cudnn.benchmark = True

# For reproducibility (slower, deterministic algorithms)
torch.backends.cudnn.deterministic = True

# Check cuDNN version
print(torch.backends.cudnn.version())  # e.g., 8902

# cuDNN is used automatically for supported operations
conv = torch.nn.Conv2d(64, 128, 3, padding=1).cuda()
x = torch.randn(32, 64, 224, 224).cuda()
y = conv(x)  # Uses cuDNN convolution under the hood

Flash Attention

Flash Attention is a breakthrough cuDNN-integrated algorithm that computes attention without materializing the full N×N attention matrix:

  • Memory: Reduces memory from O(N²) to O(N), enabling much longer sequences
  • Speed: 2-4x faster than standard attention by minimizing HBM reads/writes
  • Integration: Available in PyTorch 2.0+ via torch.nn.functional.scaled_dot_product_attention()

Kernel Fusion

cuDNN can fuse multiple operations into a single kernel launch, reducing memory traffic and kernel launch overhead:

  • Conv + Bias + ReLU fused into one kernel
  • BatchNorm + ReLU fused together
  • Attention QKV projection + attention + output projection
Key takeaway: cuDNN provides heavily optimized GPU implementations of neural network operations. Enable torch.backends.cudnn.benchmark = True for the best auto-tuned performance, and use Flash Attention for transformer models.