Intermediate

NPU Architecture

Neural Processing Units are specialized processors optimized for the mathematical operations in neural networks. They achieve orders-of-magnitude better performance per watt than CPUs by focusing silicon on what AI actually needs: matrix math and data movement.

Core Building Block: The MAC Unit

At the heart of every NPU is the Multiply-Accumulate (MAC) unit. Neural networks fundamentally compute:

output = sum(weight[i] * input[i]) + bias

A MAC unit performs one multiply-and-add in a single clock cycle. NPUs pack thousands to millions of MAC units operating in parallel. The more MACs, the more operations per second (measured in TOPS — Tera Operations Per Second).

Systolic Arrays

The most common NPU architecture is the systolic array, popularized by Google's TPU. In a systolic array:

  • Processing elements (PEs) are arranged in a 2D grid (e.g., 256x256)
  • Data flows rhythmically through the grid like a heartbeat (hence "systolic")
  • Weights are pre-loaded into the PEs, and input data flows across rows
  • Partial sums accumulate as data moves down columns
  • Result: An entire matrix multiplication completes as data flows through the array
💡
Why systolic? Each data element is read once from memory and reused across many computations as it flows through the array. This dramatically reduces memory bandwidth requirements — the primary bottleneck in AI computation.

Memory Hierarchy

LevelSizeSpeedPurpose
RegistersBytes1 cycleImmediate operands in MAC units
SRAM (on-chip)MB1-5 cyclesActive weights, activations, intermediate results
HBM (off-chip)GB100+ cyclesModel weights, large tensors
System memoryTB1000+ cyclesFull model storage, datasets

NPU design is largely about managing this hierarchy: keeping the MAC units fed with data fast enough that they never idle waiting for memory.

Dataflow Architectures

Different NPUs use different dataflow strategies to minimize data movement:

  • Weight-stationary: Weights stay fixed in PEs; inputs and outputs flow through. Used when the same weights are applied to many inputs (inference)
  • Output-stationary: Partial sums stay in PEs; weights and inputs flow in. Reduces write-back traffic
  • Input-stationary: Inputs stay fixed; weights flow through. Useful for specific layer types
  • Row-stationary: Optimizes data reuse across all data types. Used in MIT's Eyeriss architecture

Real-World NPU Examples

NPUPlatformTOPSKey Feature
Apple Neural EngineiPhone, Mac (M-series)38 (M4)Tight integration with CPU/GPU
Qualcomm HexagonSnapdragon smartphones45 (Gen 4)Heterogeneous compute
Intel NPUMeteor Lake laptops11Always-on AI tasks
Samsung NPUExynos SoCs34On-device generative AI

Reduced Precision

NPUs gain significant performance by using lower-precision arithmetic. Neural networks are remarkably tolerant of reduced precision:

  • FP32 (32-bit float): Full precision, used for training reference
  • BF16 (Brain Float 16): Same range as FP32, less precision. The standard for training
  • FP16 (16-bit float): Half the memory and compute of FP32
  • INT8 (8-bit integer): 4x throughput vs FP32. Standard for inference with quantization
  • INT4 (4-bit integer): 8x throughput. Used for compressed LLM inference
TOPS is not everything: Raw TOPS numbers are useful for comparison but do not tell the full story. Memory bandwidth, software support, compiler quality, and actual model performance matter more than peak theoretical throughput.