Intermediate

Inf2 Instances

Deep dive into Amazon EC2 Inf2 instances powered by Inferentia2 chips — understanding instance sizes, NeuronCore configurations, memory, and how to choose the right instance for your inference workload.

Inf2 Instance Family Overview

Inf2 instances are powered by up to 12 AWS Inferentia2 chips, delivering up to 4x higher throughput and up to 10x lower latency compared to Inf1 instances. They feature NeuronLink chip-to-chip interconnect for running large models across multiple chips.

Instance Types and Specifications

InstanceInf2 ChipsNeuronCoresAccelerator MemoryvCPUsSystem Memory
inf2.xlarge1232 GB HBM416 GB
inf2.8xlarge1232 GB HBM32128 GB
inf2.24xlarge612192 GB HBM96384 GB
inf2.48xlarge1224384 GB HBM192768 GB

NeuronCore Architecture

Each Inferentia2 chip contains two NeuronCore-v2 engines optimized for tensor operations:

  • Tensor engine: High-performance matrix multiplication for transformer attention and linear layers
  • Vector engine: Element-wise operations, activations, and normalization
  • Scalar engine: Control flow and scalar computations
  • HBM2e memory: 32 GB per chip with high bandwidth for model weights and activations
💡
NeuronLink: On inf2.24xlarge and inf2.48xlarge, all Inferentia2 chips are connected via NeuronLink, enabling tensor parallelism across chips. This allows you to run large models (like Llama 2 70B) that don't fit in a single chip's memory by sharding the model across multiple NeuronCores.

Choosing the Right Instance

💻

inf2.xlarge

Best for small to medium models (BERT, DistilGPT-2) with moderate throughput requirements. Most cost-effective entry point.

inf2.8xlarge

Same chip as inf2.xlarge but with more vCPUs and memory for preprocessing-heavy workloads or multiple model serving.

🚀

inf2.24xlarge

6 chips with NeuronLink for large models (Llama 2 13B, GPT-J). Enables tensor parallelism across 12 NeuronCores.

📈

inf2.48xlarge

12 chips for the largest models (Llama 2 70B). Maximum throughput and memory for production LLM serving at scale.

Performance Considerations

  • Batch size: Larger batch sizes improve throughput but increase latency; tune based on your SLA
  • Sequence length: Longer sequences require more memory; consider dynamic batching
  • Tensor parallelism degree: Match the number of NeuronCores to your model size for optimal sharding
  • Data types: BF16 and FP16 are natively supported; FP32 models are automatically cast down
  • Compilation: Models must be compiled with the Neuron compiler before deployment
Pro tip: Start with the smallest instance that fits your model in memory, then scale up based on throughput requirements. Use neuron-top to monitor NeuronCore utilization and identify if you're compute-bound or memory-bound.