Intermediate

SageMaker Training

Master model training on SageMaker with built-in algorithms, custom training scripts, distributed training, and cost optimization with spot instances.

Built-in Algorithms

SageMaker provides 17+ optimized, built-in algorithms ready to use without writing training code:

AlgorithmTypeUse Case
XGBoostClassification/RegressionTabular data, feature-rich datasets
Linear LearnerClassification/RegressionLinear relationships, high-dimensional data
K-Nearest NeighborsClassification/RegressionSimilarity-based prediction
Image ClassificationComputer VisionImage categorization with ResNet
Object DetectionComputer VisionLocating objects in images
BlazingTextNLPText classification, word embeddings
DeepARTime SeriesForecasting with autoregressive models
K-MeansClusteringUnsupervised grouping

Custom Training Jobs

For custom models, SageMaker supports bringing your own training scripts with popular frameworks:

  • Script mode: Provide a Python training script, and SageMaker handles the infrastructure
  • Framework containers: Pre-built Docker containers for TensorFlow, PyTorch, Scikit-learn, Hugging Face, and XGBoost
  • Custom containers: Build your own Docker container with any framework or dependencies
  • Input channels: Data is automatically downloaded from S3 to the training instance at /opt/ml/input/data/
  • Model output: Save model artifacts to /opt/ml/model/ and SageMaker uploads them to S3
💡
Training workflow: SageMaker provisions instances, downloads your data from S3, runs your training script, saves the model to S3, and then terminates the instances. You only pay for the time the training job runs — billed per second.

Distributed Training

SageMaker simplifies distributed training across multiple instances and GPUs:

  • Data parallelism: Split data across multiple GPUs/instances — each processes a subset and gradients are synchronized
  • Model parallelism: Split large models across multiple GPUs when a model doesn't fit in a single GPU's memory
  • SageMaker Distributed: Optimized libraries for both data and model parallelism with near-linear scaling
  • Horovod support: Use Horovod for distributed TensorFlow and PyTorch training
  • Multi-GPU instances: Use instances like ml.p3.16xlarge (8 V100 GPUs) or ml.p4d.24xlarge (8 A100 GPUs)

Hyperparameter Tuning

SageMaker Automatic Model Tuning (AMT) finds optimal hyperparameters:

  • Bayesian optimization: Intelligently explores the hyperparameter space based on previous results
  • Random search: Explore hyperparameters randomly for broad coverage
  • Grid search: Exhaustively test all combinations of specified values
  • Warm start: Continue tuning from previous tuning job results
  • Early stopping: Automatically stop poorly-performing training jobs to save resources

Spot Instances

Managed Spot Training can reduce training costs by up to 90%:

  • Automatic checkpointing: SageMaker saves training progress so jobs can resume if interrupted
  • Transparent management: SageMaker handles spot instance acquisition and interruption automatically
  • Max wait time: Set a maximum waiting time for spot capacity to become available
  • Fallback: Optionally fall back to on-demand instances if spot isn't available within your time limit
Pro tip: Always enable managed spot training for non-urgent training jobs. Set use_spot_instances=True and max_wait in your Estimator configuration. The savings are substantial and SageMaker handles all the complexity of checkpointing and resumption.