Intermediate

cuDNN — Deep Learning Primitives

cuDNN is NVIDIA's GPU-accelerated library of primitives for deep neural networks. It powers the backend of PyTorch, TensorFlow, and most deep learning frameworks.

What cuDNN Provides

Rather than writing custom CUDA kernels for every neural network operation, cuDNN provides highly optimized implementations:

Convolutions: Forward and backward passes for 2D/3D convolutions with multiple algorithm choices (Winograd, FFT, implicit GEMM)
Pooling: Max pooling, average pooling, and adaptive pooling
Normalization: Batch normalization, layer normalization, group normalization, and instance normalization
Activation functions: ReLU, sigmoid, tanh, GELU, and SiLU with fused operations
RNNs: LSTM, GRU, and vanilla RNN with optimized multi-layer implementations
Attention: Flash Attention and fused multi-head attention for transformers

Algorithm Selection

cuDNN offers multiple algorithms for the same operation. The best choice depends on input sizes, filter sizes, and hardware:

Algorithm	Best For	Trade-off
Implicit GEMM	General purpose	Reliable performance, moderate memory
Winograd	3x3 convolutions	Fastest for small filters, higher numerical error
FFT	Large filters	Fast for big filters, high memory usage
Tensor Core	FP16/BF16 operations	Fastest on modern GPUs, requires aligned shapes

cuDNN in PyTorch

Python - cuDNN Configuration in PyTorch

import torch

# Enable cuDNN auto-tuner: benchmarks multiple
# algorithms and picks the fastest for your input size
torch.backends.cudnn.benchmark = True

# For reproducibility (slower, deterministic algorithms)
torch.backends.cudnn.deterministic = True

# Check cuDNN version
print(torch.backends.cudnn.version())  # e.g., 8902

# cuDNN is used automatically for supported operations
conv = torch.nn.Conv2d(64, 128, 3, padding=1).cuda()
x = torch.randn(32, 64, 224, 224).cuda()
y = conv(x)  # Uses cuDNN convolution under the hood

Flash Attention

Flash Attention is a breakthrough cuDNN-integrated algorithm that computes attention without materializing the full N×N attention matrix:

Memory: Reduces memory from O(N²) to O(N), enabling much longer sequences
Speed: 2-4x faster than standard attention by minimizing HBM reads/writes
Integration: Available in PyTorch 2.0+ via torch.nn.functional.scaled_dot_product_attention()

Kernel Fusion

cuDNN can fuse multiple operations into a single kernel launch, reducing memory traffic and kernel launch overhead:

Conv + Bias + ReLU fused into one kernel
BatchNorm + ReLU fused together
Attention QKV projection + attention + output projection

✅

Key takeaway: cuDNN provides heavily optimized GPU implementations of neural network operations. Enable torch.backends.cudnn.benchmark = True for the best auto-tuned performance, and use Flash Attention for transformer models.

← Previous CUDA Basics Next → PyTorch GPU