Spot and Preemptible Instances for AI Intermediate
Spot instances (AWS), preemptible VMs (GCP), and spot VMs (Azure) offer the same hardware as on-demand instances at 60-90% discount. The trade-off is that the cloud provider can reclaim them with short notice. For AI training jobs that support checkpointing, this trade-off is almost always worthwhile.
Savings Comparison
| Provider | Name | Discount | Interruption Notice |
|---|---|---|---|
| AWS | Spot Instances | 60-90% | 2 minutes |
| GCP | Spot VMs | 60-91% | 30 seconds |
| Azure | Spot VMs | 60-90% | 30 seconds |
Checkpointing Strategy
The key to using spot instances safely is regular checkpointing:
Python
import torch import signal import boto3 def save_checkpoint(model, optimizer, epoch, step, path): checkpoint = { 'model_state': model.state_dict(), 'optimizer_state': optimizer.state_dict(), 'epoch': epoch, 'step': step, } torch.save(checkpoint, path) # Upload to S3 for durability boto3.client('s3').upload_file(path, 'my-bucket', f'checkpoints/{path}') # Handle spot interruption signal def handle_interruption(signum, frame): save_checkpoint(model, optimizer, current_epoch, current_step, 'emergency.pt') raise SystemExit('Spot interruption - checkpoint saved') signal.signal(signal.SIGTERM, handle_interruption)
Best Practices for Spot Training
- Checkpoint every 15-30 minutes — Balance between checkpoint overhead and lost work on interruption
- Use multiple instance types — Diversify across instance families to reduce interruption probability
- Use multiple availability zones — Spread training nodes across AZs for lower aggregate interruption risk
- Implement elastic training — Use frameworks that can dynamically add/remove workers (PyTorch Elastic)
- Set maximum price — Cap spot bids at 50-70% of on-demand to ensure meaningful savings
When NOT to Use Spot: Production inference endpoints should not use spot instances. The interruption risk is unacceptable for user-facing services. Use spot only for training, batch inference, and development workloads.
Ready to Learn Right-sizing?
The next lesson covers how to match GPU types and instance sizes to actual workload requirements.
Next: Right-sizing →
Lilly Tech Systems