Advanced

Domain 4: ML Implementation & Operations (20%)

Deploying, monitoring, and operationalizing ML models on AWS — SageMaker endpoints, A/B testing, model monitoring, security best practices, and Auto Scaling.

💡

Exam weight: This domain accounts for 20% of your score (~13 questions). Focus on SageMaker deployment patterns, endpoint configurations, and security.

SageMaker Deployment Options

Understanding the different deployment options and when to use each is critical for the exam.

Real-Time Endpoints

Persistent HTTPS endpoints for synchronous predictions
Low-latency, always-on inference
Support Auto Scaling based on invocations or custom CloudWatch metrics
Use when: Application needs immediate predictions (web apps, APIs, chatbots)
Cost: Pay per instance-hour while endpoint is running (even with zero traffic)

Batch Transform

Process an entire dataset at once without persistent endpoint
Input from S3, output to S3
Use when: Predictions on a full dataset (nightly scoring, bulk processing)
Cost-effective for large batch jobs since instances shut down after completion
Supports data joining (associating predictions with original data IDs)

Serverless Inference

Automatically provisions and scales compute capacity
Scales to zero when not in use (no charge for idle time)
Use when: Unpredictable or intermittent traffic patterns
Cold start latency (seconds) — not suitable for latency-sensitive applications

Asynchronous Inference

Queues requests and processes them asynchronously
Scales to zero instances during idle periods
Use when: Large payloads (up to 1 GB), long processing times, tolerance for latency
Notifications via SNS when predictions complete

💡

Decision tree: Need immediate response? Real-time endpoint. Processing a whole dataset? Batch transform. Intermittent traffic, cost-sensitive? Serverless. Large payloads, can wait? Async inference.

A/B Testing and Production Variants

SageMaker supports running multiple model versions behind a single endpoint:

Production Variants

Deploy multiple models to a single endpoint with traffic splitting
Example: 90% traffic to Model A (current), 10% to Model B (new)
Each variant can use different instance types and counts
Gradually shift traffic as you gain confidence in the new model

Shadow Testing

Route a copy of production traffic to a shadow variant
Shadow variant makes predictions but results are NOT returned to users
Compare shadow predictions against production without risk
Use when: You want to validate a new model on real traffic before any users see its predictions

SageMaker Model Monitor

Monitors deployed models for quality degradation over time:

Monitoring Types

Data quality monitoring — Detects drift in input data distributions. Compares current data against a baseline (training data statistics).
Model quality monitoring — Tracks model accuracy over time. Requires ground truth labels (may have a delay).
Bias drift monitoring — Uses SageMaker Clarify to detect changes in bias metrics over time.
Feature attribution drift — Detects changes in feature importance using SHAP values.

How It Works

Create a baseline from training data (statistics, constraints)
Schedule monitoring jobs (hourly, daily) via Processing jobs
Violations generate CloudWatch alarms and can trigger retraining
Data capture records input/output for every prediction to S3

⚠

Exam concept: "Data drift" = the input data distribution changes over time (e.g., customer demographics shift). "Concept drift" = the relationship between input and output changes (e.g., customer behavior changes). Both require model retraining, but you detect data drift first (no labels needed) and concept drift later (requires ground truth).

SageMaker Pipelines

CI/CD for machine learning workflows:

Define end-to-end ML workflows as directed acyclic graphs (DAGs)
Steps: Processing, Training, Tuning, Model Registration, Transform, Condition
Integrates with SageMaker Model Registry for model versioning and approval
Tracks lineage: which data, code, and parameters produced each model
Supports parameterized execution (different datasets, hyperparameters per run)

SageMaker Model Registry

Catalog of trained models with versioning
Approval workflow: PendingManualApproval → Approved → Deployed
Tracks model lineage (training data, algorithm, metrics)
Integrates with CI/CD for automated deployment upon approval

Auto Scaling

SageMaker endpoints support Auto Scaling with Application Auto Scaling:

Target tracking scaling — Maintain a target metric (e.g., InvocationsPerInstance = 100). SageMaker adjusts instance count automatically. Most common on exam.
Step scaling — Add/remove instances based on CloudWatch alarm thresholds.
Scheduled scaling — Pre-scale for predictable traffic patterns (e.g., scale up before business hours).
Set minimum and maximum instance counts to control costs
Cool-down period prevents rapid scaling oscillations

Security for ML on AWS

Security is tested throughout the exam, not just in this domain:

Encryption

At rest: S3 server-side encryption (SSE-S3, SSE-KMS, SSE-C). SageMaker training volumes encrypted with KMS.
In transit: All SageMaker API calls over HTTPS/TLS. Inter-container communication encrypted for distributed training.
KMS customer managed keys for sensitive data. Specify in training job and endpoint configurations.

Network Isolation

VPC mode: Run SageMaker training and inference inside your VPC (no internet access by default)
VPC endpoints (PrivateLink): Access SageMaker API without going through the internet
Network isolation flag: Completely isolates the container (no network access at all, not even to other AWS services)

IAM for ML

SageMaker execution role: IAM role that SageMaker assumes for training and inference. Must have permissions for S3, ECR, CloudWatch, KMS.
Least privilege: Grant only the S3 buckets, KMS keys, and ECR repositories needed.
Resource-based policies: S3 bucket policies to restrict which SageMaker roles can access training data.

💡

Exam pattern: When a question says "data must not traverse the internet," the answer involves VPC configuration + VPC endpoints (PrivateLink). When it says "container must not access any external resources," the answer is the network isolation flag.

Edge Deployment

SageMaker Neo: Compiles models for specific hardware (ARM, x86, GPU). Reduces model size and improves inference speed.
SageMaker Edge Manager: Deploy and manage models on edge devices (IoT). Monitors model performance on edge.
AWS IoT Greengrass: Run ML inference locally on IoT devices using Lambda or containers.

Containers and Frameworks

Pre-built containers: AWS provides Docker containers for TensorFlow, PyTorch, MXNet, scikit-learn, XGBoost. Use these as the default.
Bring Your Own Container (BYOC): Custom Docker image when pre-built containers do not support your framework or dependencies. Must follow SageMaker container contract (specific directory structure, entry points).
Script Mode: Use a pre-built container but supply your own training/inference script. Easier than BYOC.
ECR (Elastic Container Registry): Store custom containers. SageMaker pulls from ECR during training and deployment.

Practice Questions

A company needs to generate predictions for 10 million customer records every night for a marketing campaign. The predictions are not time-sensitive. Which SageMaker deployment option is most cost-effective?

A) Real-time endpoint with Auto Scaling
B) Batch Transform
C) Serverless Inference
D) Asynchronous Inference

Answer: B — Batch Transform is designed for exactly this scenario: processing a large dataset at once without maintaining a persistent endpoint. Instances spin up, process all 10 million records, and shut down. A real-time endpoint (A) would run 24/7, wasting cost during the ~23 hours of no traffic. Serverless (C) has cold starts and per-request costs that add up for 10 million records. Async (D) is for individual large payloads, not bulk processing.

A team wants to deploy a new version of their fraud detection model alongside the current production model, sending 5% of traffic to the new model. If the new model performs well, they will gradually increase its traffic share. Which SageMaker feature should they use?

A) Multi-model endpoints
B) Production variants with traffic splitting
C) SageMaker Pipelines
D) Batch Transform with two models

Answer: B — Production variants allow deploying multiple model versions behind a single endpoint with configurable traffic splitting (e.g., 95%/5%). Traffic can be gradually shifted using UpdateEndpointWeightsAndCapacities without downtime. Multi-model endpoints (A) host multiple models but do not split traffic. Pipelines (C) orchestrate training, not deployment. Batch Transform (D) is for offline processing.

A deployed ML model starts producing inaccurate predictions after several months. Investigation reveals that the distribution of incoming customer data has shifted significantly from the training data. Which SageMaker feature would have detected this issue earliest?

A) SageMaker Model Monitor — Data Quality monitoring
B) SageMaker Model Monitor — Model Quality monitoring
C) SageMaker Debugger
D) CloudWatch inference latency metrics

Answer: A — Data Quality monitoring detects data drift by comparing incoming data distributions against the training data baseline. This would catch the distribution shift immediately, without waiting for ground truth labels. Model Quality monitoring (B) requires ground truth labels, which often arrive with a delay. Debugger (C) is for training, not production. CloudWatch latency (D) does not measure data quality.

A financial services company requires that ML training data and model artifacts never traverse the public internet, and all data must be encrypted with customer-managed KMS keys. What configuration is needed?

A) Enable SSL on the SageMaker endpoint
B) Configure SageMaker to run in a VPC with no internet gateway, use VPC endpoints for S3 and SageMaker, and specify KMS key IDs in training and endpoint configurations
C) Use S3 bucket policies to restrict access
D) Enable network isolation on the training job

Answer: B — This is the complete solution: VPC without internet gateway ensures no public internet traversal, VPC endpoints (PrivateLink) for S3 and SageMaker API keep all traffic on the AWS private network, and KMS key IDs in job configurations ensure customer-managed encryption. A only covers TLS in transit. C does not prevent internet traversal. D prevents ALL network access (too restrictive, would block S3 access).

A SageMaker real-time endpoint experiences traffic spikes every weekday at 9 AM. The current scaling policy reacts too slowly, causing timeouts during the spike. What should the team do?

A) Increase the maximum instance count
B) Add a scheduled scaling action to pre-scale before 9 AM combined with target tracking for unexpected spikes
C) Switch to Batch Transform
D) Use a larger instance type

Answer: B — Scheduled scaling proactively adds capacity before the predictable 9 AM spike, eliminating the reaction delay. Combined with target tracking scaling for unexpected variations throughout the day, this provides both proactive and reactive scaling. A alone does not help if scaling is too slow. C changes the architecture unnecessarily. D might help but does not address the scaling speed issue.

Key Takeaways for the Exam

Real-time endpoints for immediate predictions. Batch Transform for bulk processing. Serverless for intermittent traffic.
Production variants enable A/B testing with traffic splitting on a single endpoint.
Model Monitor: Data Quality (detects data drift without labels), Model Quality (needs ground truth labels).
VPC + VPC endpoints (PrivateLink) = no internet traversal. Network isolation = no network access at all.
SageMaker Neo for edge deployment optimization. Edge Manager for device fleet management.
Scheduled + target tracking scaling for predictable traffic patterns with unexpected spikes.
KMS customer-managed keys for data encryption at rest. All SageMaker traffic is TLS-encrypted in transit.

← Previous Modeling (36%) Next → Practice Exam 1