Spot and Preemptible Instances for AI Intermediate

Spot instances (AWS), preemptible VMs (GCP), and spot VMs (Azure) offer the same hardware as on-demand instances at 60-90% discount. The trade-off is that the cloud provider can reclaim them with short notice. For AI training jobs that support checkpointing, this trade-off is almost always worthwhile.

Savings Comparison

Provider Name Discount Interruption Notice
AWS Spot Instances 60-90% 2 minutes
GCP Spot VMs 60-91% 30 seconds
Azure Spot VMs 60-90% 30 seconds

Checkpointing Strategy

The key to using spot instances safely is regular checkpointing:

Python
import torch
import signal
import boto3

def save_checkpoint(model, optimizer, epoch, step, path):
    checkpoint = {
        'model_state': model.state_dict(),
        'optimizer_state': optimizer.state_dict(),
        'epoch': epoch,
        'step': step,
    }
    torch.save(checkpoint, path)
    # Upload to S3 for durability
    boto3.client('s3').upload_file(path, 'my-bucket', f'checkpoints/{path}')

# Handle spot interruption signal
def handle_interruption(signum, frame):
    save_checkpoint(model, optimizer, current_epoch, current_step, 'emergency.pt')
    raise SystemExit('Spot interruption - checkpoint saved')

signal.signal(signal.SIGTERM, handle_interruption)

Best Practices for Spot Training

  • Checkpoint every 15-30 minutes — Balance between checkpoint overhead and lost work on interruption
  • Use multiple instance types — Diversify across instance families to reduce interruption probability
  • Use multiple availability zones — Spread training nodes across AZs for lower aggregate interruption risk
  • Implement elastic training — Use frameworks that can dynamically add/remove workers (PyTorch Elastic)
  • Set maximum price — Cap spot bids at 50-70% of on-demand to ensure meaningful savings
When NOT to Use Spot: Production inference endpoints should not use spot instances. The interruption risk is unacceptable for user-facing services. Use spot only for training, batch inference, and development workloads.

Ready to Learn Right-sizing?

The next lesson covers how to match GPU types and instance sizes to actual workload requirements.

Next: Right-sizing →