Intermediate

AWS Trainium

Explore AWS Trainium chips and Trn1/Trn2 instances purpose-built for high-performance, cost-effective machine learning model training at scale.

What is Trainium?

AWS Trainium is a second-generation ML training chip designed to deliver high-performance training at up to 50% cost savings compared to comparable GPU instances. Trn1 instances powered by Trainium are optimized for training deep learning models including large language models and diffusion models.

Trn1 Instance Specifications

InstanceTrainium ChipsNeuronCoresAccelerator MemoryvCPUsNetworking
trn1.2xlarge1232 GB HBM8Up to 12.5 Gbps
trn1.32xlarge1632512 GB HBM128800 Gbps EFA
trn1n.32xlarge1632512 GB HBM1281600 Gbps EFA

Key Features

NeuronLink

Ultra-fast chip-to-chip interconnect enabling data and model parallelism across all 16 Trainium chips within an instance.

🚀

EFA Networking

Elastic Fabric Adapter with up to 1600 Gbps for multi-node distributed training across hundreds of instances.

📈

Mixed Precision

Native support for BF16, FP16, TF32, and FP32 with stochastic rounding for training accuracy.

🛠

Large Model Support

512 GB HBM on trn1.32xlarge enables training models with billions of parameters without complex offloading.

Distributed Training on Trainium

Trainium supports multiple parallelism strategies for training large models:

  • Data parallelism: Replicate the model across NeuronCores and split batches using Neuron Distributed Data Parallel (DDP)
  • Tensor parallelism: Shard model layers across NeuronCores connected via NeuronLink
  • Pipeline parallelism: Split model stages across groups of NeuronCores for deep models
  • ZeRO-1 optimization: Shard optimizer states across data-parallel workers to reduce memory
  • Multi-node training: Scale beyond a single instance using EFA for inter-node communication

Trainium vs GPU Cost Comparison

Metrictrn1.32xlargep4d.24xlarge (A100)Advantage
On-Demand price/hr~$21.50~$32.7734% lower
Accelerator memory512 GB320 GB60% more
Network bandwidth800 Gbps400 Gbps2x higher
Training throughputComparableBaselineSimilar perf
Pro tip: Start with trn1.2xlarge for small-scale experiments and model compilation testing. Once your training script works correctly, scale up to trn1.32xlarge or trn1n.32xlarge for production training runs. Use the neuron_parallel_compile utility to pre-compile graphs and reduce first-run overhead.