AWS Trainium
Explore AWS Trainium chips and Trn1/Trn2 instances purpose-built for high-performance, cost-effective machine learning model training at scale.
What is Trainium?
AWS Trainium is a second-generation ML training chip designed to deliver high-performance training at up to 50% cost savings compared to comparable GPU instances. Trn1 instances powered by Trainium are optimized for training deep learning models including large language models and diffusion models.
Trn1 Instance Specifications
| Instance | Trainium Chips | NeuronCores | Accelerator Memory | vCPUs | Networking |
|---|---|---|---|---|---|
| trn1.2xlarge | 1 | 2 | 32 GB HBM | 8 | Up to 12.5 Gbps |
| trn1.32xlarge | 16 | 32 | 512 GB HBM | 128 | 800 Gbps EFA |
| trn1n.32xlarge | 16 | 32 | 512 GB HBM | 128 | 1600 Gbps EFA |
Key Features
NeuronLink
Ultra-fast chip-to-chip interconnect enabling data and model parallelism across all 16 Trainium chips within an instance.
EFA Networking
Elastic Fabric Adapter with up to 1600 Gbps for multi-node distributed training across hundreds of instances.
Mixed Precision
Native support for BF16, FP16, TF32, and FP32 with stochastic rounding for training accuracy.
Large Model Support
512 GB HBM on trn1.32xlarge enables training models with billions of parameters without complex offloading.
Distributed Training on Trainium
Trainium supports multiple parallelism strategies for training large models:
- Data parallelism: Replicate the model across NeuronCores and split batches using Neuron Distributed Data Parallel (DDP)
- Tensor parallelism: Shard model layers across NeuronCores connected via NeuronLink
- Pipeline parallelism: Split model stages across groups of NeuronCores for deep models
- ZeRO-1 optimization: Shard optimizer states across data-parallel workers to reduce memory
- Multi-node training: Scale beyond a single instance using EFA for inter-node communication
Trainium vs GPU Cost Comparison
| Metric | trn1.32xlarge | p4d.24xlarge (A100) | Advantage |
|---|---|---|---|
| On-Demand price/hr | ~$21.50 | ~$32.77 | 34% lower |
| Accelerator memory | 512 GB | 320 GB | 60% more |
| Network bandwidth | 800 Gbps | 400 Gbps | 2x higher |
| Training throughput | Comparable | Baseline | Similar perf |
neuron_parallel_compile utility to pre-compile graphs and reduce first-run overhead.
Lilly Tech Systems