Intermediate

NPU Architecture

Neural Processing Units are specialized processors optimized for the mathematical operations in neural networks. They achieve orders-of-magnitude better performance per watt than CPUs by focusing silicon on what AI actually needs: matrix math and data movement.

Core Building Block: The MAC Unit

At the heart of every NPU is the Multiply-Accumulate (MAC) unit. Neural networks fundamentally compute:

output = sum(weight[i] * input[i]) + bias

A MAC unit performs one multiply-and-add in a single clock cycle. NPUs pack thousands to millions of MAC units operating in parallel. The more MACs, the more operations per second (measured in TOPS — Tera Operations Per Second).

Systolic Arrays

The most common NPU architecture is the systolic array, popularized by Google's TPU. In a systolic array:

Processing elements (PEs) are arranged in a 2D grid (e.g., 256x256)
Data flows rhythmically through the grid like a heartbeat (hence "systolic")
Weights are pre-loaded into the PEs, and input data flows across rows
Partial sums accumulate as data moves down columns
Result: An entire matrix multiplication completes as data flows through the array

💡

Why systolic? Each data element is read once from memory and reused across many computations as it flows through the array. This dramatically reduces memory bandwidth requirements — the primary bottleneck in AI computation.

Memory Hierarchy

Level	Size	Speed	Purpose
Registers	Bytes	1 cycle	Immediate operands in MAC units
SRAM (on-chip)	MB	1-5 cycles	Active weights, activations, intermediate results
HBM (off-chip)	GB	100+ cycles	Model weights, large tensors
System memory	TB	1000+ cycles	Full model storage, datasets

NPU design is largely about managing this hierarchy: keeping the MAC units fed with data fast enough that they never idle waiting for memory.

Dataflow Architectures

Different NPUs use different dataflow strategies to minimize data movement:

Weight-stationary: Weights stay fixed in PEs; inputs and outputs flow through. Used when the same weights are applied to many inputs (inference)
Output-stationary: Partial sums stay in PEs; weights and inputs flow in. Reduces write-back traffic
Input-stationary: Inputs stay fixed; weights flow through. Useful for specific layer types
Row-stationary: Optimizes data reuse across all data types. Used in MIT's Eyeriss architecture

Real-World NPU Examples

NPU	Platform	TOPS	Key Feature
Apple Neural Engine	iPhone, Mac (M-series)	38 (M4)	Tight integration with CPU/GPU
Qualcomm Hexagon	Snapdragon smartphones	45 (Gen 4)	Heterogeneous compute
Intel NPU	Meteor Lake laptops	11	Always-on AI tasks
Samsung NPU	Exynos SoCs	34	On-device generative AI

Reduced Precision

NPUs gain significant performance by using lower-precision arithmetic. Neural networks are remarkably tolerant of reduced precision:

FP32 (32-bit float): Full precision, used for training reference
BF16 (Brain Float 16): Same range as FP32, less precision. The standard for training
FP16 (16-bit float): Half the memory and compute of FP32
INT8 (8-bit integer): 4x throughput vs FP32. Standard for inference with quantization
INT4 (4-bit integer): 8x throughput. Used for compressed LLM inference

✅

TOPS is not everything: Raw TOPS numbers are useful for comparison but do not tell the full story. Memory bandwidth, software support, compiler quality, and actual model performance matter more than peak theoretical throughput.

← Previous Introduction Next → ASIC Design