NPU Architecture
Neural Processing Units are specialized processors optimized for the mathematical operations in neural networks. They achieve orders-of-magnitude better performance per watt than CPUs by focusing silicon on what AI actually needs: matrix math and data movement.
Core Building Block: The MAC Unit
At the heart of every NPU is the Multiply-Accumulate (MAC) unit. Neural networks fundamentally compute:
output = sum(weight[i] * input[i]) + bias
A MAC unit performs one multiply-and-add in a single clock cycle. NPUs pack thousands to millions of MAC units operating in parallel. The more MACs, the more operations per second (measured in TOPS — Tera Operations Per Second).
Systolic Arrays
The most common NPU architecture is the systolic array, popularized by Google's TPU. In a systolic array:
- Processing elements (PEs) are arranged in a 2D grid (e.g., 256x256)
- Data flows rhythmically through the grid like a heartbeat (hence "systolic")
- Weights are pre-loaded into the PEs, and input data flows across rows
- Partial sums accumulate as data moves down columns
- Result: An entire matrix multiplication completes as data flows through the array
Memory Hierarchy
| Level | Size | Speed | Purpose |
|---|---|---|---|
| Registers | Bytes | 1 cycle | Immediate operands in MAC units |
| SRAM (on-chip) | MB | 1-5 cycles | Active weights, activations, intermediate results |
| HBM (off-chip) | GB | 100+ cycles | Model weights, large tensors |
| System memory | TB | 1000+ cycles | Full model storage, datasets |
NPU design is largely about managing this hierarchy: keeping the MAC units fed with data fast enough that they never idle waiting for memory.
Dataflow Architectures
Different NPUs use different dataflow strategies to minimize data movement:
- Weight-stationary: Weights stay fixed in PEs; inputs and outputs flow through. Used when the same weights are applied to many inputs (inference)
- Output-stationary: Partial sums stay in PEs; weights and inputs flow in. Reduces write-back traffic
- Input-stationary: Inputs stay fixed; weights flow through. Useful for specific layer types
- Row-stationary: Optimizes data reuse across all data types. Used in MIT's Eyeriss architecture
Real-World NPU Examples
| NPU | Platform | TOPS | Key Feature |
|---|---|---|---|
| Apple Neural Engine | iPhone, Mac (M-series) | 38 (M4) | Tight integration with CPU/GPU |
| Qualcomm Hexagon | Snapdragon smartphones | 45 (Gen 4) | Heterogeneous compute |
| Intel NPU | Meteor Lake laptops | 11 | Always-on AI tasks |
| Samsung NPU | Exynos SoCs | 34 | On-device generative AI |
Reduced Precision
NPUs gain significant performance by using lower-precision arithmetic. Neural networks are remarkably tolerant of reduced precision:
- FP32 (32-bit float): Full precision, used for training reference
- BF16 (Brain Float 16): Same range as FP32, less precision. The standard for training
- FP16 (16-bit float): Half the memory and compute of FP32
- INT8 (8-bit integer): 4x throughput vs FP32. Standard for inference with quantization
- INT4 (4-bit integer): 8x throughput. Used for compressed LLM inference
Lilly Tech Systems