Inf2 Instances
Deep dive into Amazon EC2 Inf2 instances powered by Inferentia2 chips — understanding instance sizes, NeuronCore configurations, memory, and how to choose the right instance for your inference workload.
Inf2 Instance Family Overview
Inf2 instances are powered by up to 12 AWS Inferentia2 chips, delivering up to 4x higher throughput and up to 10x lower latency compared to Inf1 instances. They feature NeuronLink chip-to-chip interconnect for running large models across multiple chips.
Instance Types and Specifications
| Instance | Inf2 Chips | NeuronCores | Accelerator Memory | vCPUs | System Memory |
|---|---|---|---|---|---|
| inf2.xlarge | 1 | 2 | 32 GB HBM | 4 | 16 GB |
| inf2.8xlarge | 1 | 2 | 32 GB HBM | 32 | 128 GB |
| inf2.24xlarge | 6 | 12 | 192 GB HBM | 96 | 384 GB |
| inf2.48xlarge | 12 | 24 | 384 GB HBM | 192 | 768 GB |
NeuronCore Architecture
Each Inferentia2 chip contains two NeuronCore-v2 engines optimized for tensor operations:
- Tensor engine: High-performance matrix multiplication for transformer attention and linear layers
- Vector engine: Element-wise operations, activations, and normalization
- Scalar engine: Control flow and scalar computations
- HBM2e memory: 32 GB per chip with high bandwidth for model weights and activations
Choosing the Right Instance
inf2.xlarge
Best for small to medium models (BERT, DistilGPT-2) with moderate throughput requirements. Most cost-effective entry point.
inf2.8xlarge
Same chip as inf2.xlarge but with more vCPUs and memory for preprocessing-heavy workloads or multiple model serving.
inf2.24xlarge
6 chips with NeuronLink for large models (Llama 2 13B, GPT-J). Enables tensor parallelism across 12 NeuronCores.
inf2.48xlarge
12 chips for the largest models (Llama 2 70B). Maximum throughput and memory for production LLM serving at scale.
Performance Considerations
- Batch size: Larger batch sizes improve throughput but increase latency; tune based on your SLA
- Sequence length: Longer sequences require more memory; consider dynamic batching
- Tensor parallelism degree: Match the number of NeuronCores to your model size for optimal sharding
- Data types: BF16 and FP16 are natively supported; FP32 models are automatically cast down
- Compilation: Models must be compiled with the Neuron compiler before deployment
neuron-top to monitor NeuronCore utilization and identify if you're compute-bound or memory-bound.
Lilly Tech Systems