Intermediate

CUDA Basics

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform. Understanding its programming model is key to writing efficient GPU code.

The CUDA Programming Model

CUDA extends C/C++ with special keywords for writing functions (kernels) that run on the GPU. The execution model is hierarchical:

Thread: The smallest unit of execution. Each thread runs the same kernel code but on different data.
Block: A group of threads (up to 1024) that can share memory and synchronize. Threads in a block execute on the same Streaming Multiprocessor.
Grid: A collection of blocks. The grid represents the entire kernel launch.
Warp: Hardware groups of 32 threads that execute instructions in lockstep (SIMT).

Your First CUDA Kernel

CUDA C++ - Vector Addition

// Kernel: runs on GPU, called from CPU
__global__ void vectorAdd(float* a, float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    int n = 1000000;
    float *d_a, *d_b, *d_c;

    // Allocate GPU memory
    cudaMalloc(&d_a, n * sizeof(float));
    cudaMalloc(&d_b, n * sizeof(float));
    cudaMalloc(&d_c, n * sizeof(float));

    // Launch kernel: 256 threads per block
    int blockSize = 256;
    int numBlocks = (n + blockSize - 1) / blockSize;
    vectorAdd<<<numBlocks, blockSize>>>(d_a, d_b, d_c, n);

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
}

Memory Hierarchy

Memory Type	Scope	Speed	Size
Registers	Per thread	Fastest	~256 KB per SM
Shared Memory	Per block	Very fast (~100x global)	48-228 KB per SM
L2 Cache	All SMs	Fast	6-50 MB
Global Memory (HBM)	All threads	Slowest on GPU	24-80 GB

Memory Coalescing

When threads in a warp access consecutive memory addresses, the GPU can combine these into a single memory transaction. This coalesced access pattern is critical for performance:

Good: Thread 0 reads element 0, thread 1 reads element 1, etc. (sequential access)
Bad: Thread 0 reads element 0, thread 1 reads element 1000 (strided access — many separate transactions)

CUDA with Python (Numba)

Python - CUDA Kernel with Numba

from numba import cuda
import numpy as np

@cuda.jit
def vector_add(a, b, c):
    idx = cuda.grid(1)
    if idx < a.size:
        c[idx] = a[idx] + b[idx]

# Host arrays
n = 1_000_000
a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
c = np.zeros(n, dtype=np.float32)

# Launch kernel
threads_per_block = 256
blocks = (n + threads_per_block - 1) // threads_per_block
vector_add[blocks, threads_per_block](a, b, c)

✅

Key takeaway: CUDA's thread/block/grid hierarchy maps parallel work to GPU hardware. Understanding memory coalescing, shared memory, and the warp execution model is essential for writing performant GPU kernels.

← Previous Introduction Next → cuDNN