Intermediate

Inf2 Instances

Deep dive into Amazon EC2 Inf2 instances powered by Inferentia2 chips — understanding instance sizes, NeuronCore configurations, memory, and how to choose the right instance for your inference workload.

Inf2 Instance Family Overview

Inf2 instances are powered by up to 12 AWS Inferentia2 chips, delivering up to 4x higher throughput and up to 10x lower latency compared to Inf1 instances. They feature NeuronLink chip-to-chip interconnect for running large models across multiple chips.

Instance Types and Specifications

Instance	Inf2 Chips	NeuronCores	Accelerator Memory	vCPUs	System Memory
inf2.xlarge	1	2	32 GB HBM	4	16 GB
inf2.8xlarge	1	2	32 GB HBM	32	128 GB
inf2.24xlarge	6	12	192 GB HBM	96	384 GB
inf2.48xlarge	12	24	384 GB HBM	192	768 GB

NeuronCore Architecture

Each Inferentia2 chip contains two NeuronCore-v2 engines optimized for tensor operations:

Tensor engine: High-performance matrix multiplication for transformer attention and linear layers
Vector engine: Element-wise operations, activations, and normalization
Scalar engine: Control flow and scalar computations
HBM2e memory: 32 GB per chip with high bandwidth for model weights and activations

💡

NeuronLink: On inf2.24xlarge and inf2.48xlarge, all Inferentia2 chips are connected via NeuronLink, enabling tensor parallelism across chips. This allows you to run large models (like Llama 2 70B) that don't fit in a single chip's memory by sharding the model across multiple NeuronCores.

Choosing the Right Instance

💻

inf2.xlarge

Best for small to medium models (BERT, DistilGPT-2) with moderate throughput requirements. Most cost-effective entry point.

⚡

inf2.8xlarge

Same chip as inf2.xlarge but with more vCPUs and memory for preprocessing-heavy workloads or multiple model serving.

🚀

inf2.24xlarge

6 chips with NeuronLink for large models (Llama 2 13B, GPT-J). Enables tensor parallelism across 12 NeuronCores.

📈

inf2.48xlarge

12 chips for the largest models (Llama 2 70B). Maximum throughput and memory for production LLM serving at scale.

Performance Considerations

Batch size: Larger batch sizes improve throughput but increase latency; tune based on your SLA
Sequence length: Longer sequences require more memory; consider dynamic batching
Tensor parallelism degree: Match the number of NeuronCores to your model size for optimal sharding
Data types: BF16 and FP16 are natively supported; FP32 models are automatically cast down
Compilation: Models must be compiled with the Neuron compiler before deployment

✅

Pro tip: Start with the smallest instance that fits your model in memory, then scale up based on throughput requirements. Use neuron-top to monitor NeuronCore utilization and identify if you're compute-bound or memory-bound.

← Previous Introduction Next → Trainium