Intermediate
CUDA Basics
CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform. Understanding its programming model is key to writing efficient GPU code.
The CUDA Programming Model
CUDA extends C/C++ with special keywords for writing functions (kernels) that run on the GPU. The execution model is hierarchical:
- Thread: The smallest unit of execution. Each thread runs the same kernel code but on different data.
- Block: A group of threads (up to 1024) that can share memory and synchronize. Threads in a block execute on the same Streaming Multiprocessor.
- Grid: A collection of blocks. The grid represents the entire kernel launch.
- Warp: Hardware groups of 32 threads that execute instructions in lockstep (SIMT).
Your First CUDA Kernel
CUDA C++ - Vector Addition
// Kernel: runs on GPU, called from CPU __global__ void vectorAdd(float* a, float* b, float* c, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < n) { c[idx] = a[idx] + b[idx]; } } int main() { int n = 1000000; float *d_a, *d_b, *d_c; // Allocate GPU memory cudaMalloc(&d_a, n * sizeof(float)); cudaMalloc(&d_b, n * sizeof(float)); cudaMalloc(&d_c, n * sizeof(float)); // Launch kernel: 256 threads per block int blockSize = 256; int numBlocks = (n + blockSize - 1) / blockSize; vectorAdd<<<numBlocks, blockSize>>>(d_a, d_b, d_c, n); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); }
Memory Hierarchy
| Memory Type | Scope | Speed | Size |
|---|---|---|---|
| Registers | Per thread | Fastest | ~256 KB per SM |
| Shared Memory | Per block | Very fast (~100x global) | 48-228 KB per SM |
| L2 Cache | All SMs | Fast | 6-50 MB |
| Global Memory (HBM) | All threads | Slowest on GPU | 24-80 GB |
Memory Coalescing
When threads in a warp access consecutive memory addresses, the GPU can combine these into a single memory transaction. This coalesced access pattern is critical for performance:
- Good: Thread 0 reads element 0, thread 1 reads element 1, etc. (sequential access)
- Bad: Thread 0 reads element 0, thread 1 reads element 1000 (strided access — many separate transactions)
CUDA with Python (Numba)
Python - CUDA Kernel with Numba
from numba import cuda import numpy as np @cuda.jit def vector_add(a, b, c): idx = cuda.grid(1) if idx < a.size: c[idx] = a[idx] + b[idx] # Host arrays n = 1_000_000 a = np.random.randn(n).astype(np.float32) b = np.random.randn(n).astype(np.float32) c = np.zeros(n, dtype=np.float32) # Launch kernel threads_per_block = 256 blocks = (n + threads_per_block - 1) // threads_per_block vector_add[blocks, threads_per_block](a, b, c)
Key takeaway: CUDA's thread/block/grid hierarchy maps parallel work to GPU hardware. Understanding memory coalescing, shared memory, and the warp execution model is essential for writing performant GPU kernels.
Lilly Tech Systems