Intermediate
cuDNN — Deep Learning Primitives
cuDNN is NVIDIA's GPU-accelerated library of primitives for deep neural networks. It powers the backend of PyTorch, TensorFlow, and most deep learning frameworks.
What cuDNN Provides
Rather than writing custom CUDA kernels for every neural network operation, cuDNN provides highly optimized implementations:
- Convolutions: Forward and backward passes for 2D/3D convolutions with multiple algorithm choices (Winograd, FFT, implicit GEMM)
- Pooling: Max pooling, average pooling, and adaptive pooling
- Normalization: Batch normalization, layer normalization, group normalization, and instance normalization
- Activation functions: ReLU, sigmoid, tanh, GELU, and SiLU with fused operations
- RNNs: LSTM, GRU, and vanilla RNN with optimized multi-layer implementations
- Attention: Flash Attention and fused multi-head attention for transformers
Algorithm Selection
cuDNN offers multiple algorithms for the same operation. The best choice depends on input sizes, filter sizes, and hardware:
| Algorithm | Best For | Trade-off |
|---|---|---|
| Implicit GEMM | General purpose | Reliable performance, moderate memory |
| Winograd | 3x3 convolutions | Fastest for small filters, higher numerical error |
| FFT | Large filters | Fast for big filters, high memory usage |
| Tensor Core | FP16/BF16 operations | Fastest on modern GPUs, requires aligned shapes |
cuDNN in PyTorch
Python - cuDNN Configuration in PyTorch
import torch # Enable cuDNN auto-tuner: benchmarks multiple # algorithms and picks the fastest for your input size torch.backends.cudnn.benchmark = True # For reproducibility (slower, deterministic algorithms) torch.backends.cudnn.deterministic = True # Check cuDNN version print(torch.backends.cudnn.version()) # e.g., 8902 # cuDNN is used automatically for supported operations conv = torch.nn.Conv2d(64, 128, 3, padding=1).cuda() x = torch.randn(32, 64, 224, 224).cuda() y = conv(x) # Uses cuDNN convolution under the hood
Flash Attention
Flash Attention is a breakthrough cuDNN-integrated algorithm that computes attention without materializing the full N×N attention matrix:
- Memory: Reduces memory from O(N²) to O(N), enabling much longer sequences
- Speed: 2-4x faster than standard attention by minimizing HBM reads/writes
- Integration: Available in PyTorch 2.0+ via
torch.nn.functional.scaled_dot_product_attention()
Kernel Fusion
cuDNN can fuse multiple operations into a single kernel launch, reducing memory traffic and kernel launch overhead:
- Conv + Bias + ReLU fused into one kernel
- BatchNorm + ReLU fused together
- Attention QKV projection + attention + output projection
Key takeaway: cuDNN provides heavily optimized GPU implementations of neural network operations. Enable
torch.backends.cudnn.benchmark = True for the best auto-tuned performance, and use Flash Attention for transformer models.
Lilly Tech Systems