Advanced

Practice Exam 1

25 exam-style questions covering all 4 domains of the AWS Machine Learning Specialty exam. Try to answer each question before reading the explanation. Target time: 70 minutes (matching the real exam pace of ~2.75 min/question).

💡

Exam simulation: Cover the answer explanations as you go. Write down your answers first, then check. A passing score would be roughly 18/25 correct (72%).

Question 1 — Data Engineering

A company collects customer interaction logs in JSON format from 50 microservices. They need to build a centralized data lake for ML model training. The solution must automatically discover schemas as new services are added. Which approach requires the least operational effort?

A) Write custom Lambda functions to parse each service's JSON schema and create Glue Data Catalog entries
B) Use AWS Glue Crawlers to automatically discover and catalog schemas from S3
C) Create an EMR cluster running Hive to define schemas manually
D) Use Amazon Redshift Spectrum with manual DDL statements

Answer: B — Glue Crawlers automatically discover schemas by scanning data in S3, inferring column types, and populating the Glue Data Catalog. When new services are added, the crawler discovers new schemas automatically on its next run. All other options require manual intervention for each new service.

Question 2 — Data Engineering

A streaming application generates 10,000 events per second. A data engineer needs to store this data in S3 for ML model training while also enabling real-time dashboards. Which architecture best meets both requirements?

A) Kinesis Data Firehose → S3 (storage) + Kinesis Data Analytics (real-time dashboards)
B) Kinesis Data Streams → Lambda consumer (write to S3) + Kinesis Data Analytics consumer (dashboards)
C) SQS → Lambda (write to S3) + separate SQS queue for dashboard service
D) Direct S3 PUT API calls from the application + CloudWatch for dashboards

Answer: B — Kinesis Data Streams supports multiple consumers, allowing one consumer (Lambda or Firehose) to write to S3 for storage while another consumer (Kinesis Data Analytics) powers real-time dashboards. Option A uses Firehose which is near-real-time (60s buffer) not true real-time. C cannot guarantee ordering and SQS does not natively support multiple consumers per message. D is not designed for high-throughput streaming.

Question 3 — Data Engineering

A data scientist needs to join a 2 TB CSV dataset in S3 with a 50 GB table in Amazon RDS PostgreSQL for ML feature engineering. The joined result should be stored in S3 as Parquet. Which service should they use?

A) Amazon Athena with federated queries
B) AWS Glue ETL job
C) Amazon Redshift COPY command
D) SageMaker Processing job

Answer: B — AWS Glue ETL jobs can connect to both S3 and RDS as data sources, perform joins using Spark, and write the output as Parquet to S3. This is a serverless, managed solution designed for exactly this type of cross-source ETL. Athena federated queries (A) have performance limitations for large joins. Redshift (C) requires loading all data into Redshift first. SageMaker Processing (D) could work but is not designed for ETL.

Question 4 — Exploratory Data Analysis

A dataset has a "zip_code" column with 42,000 unique values. The data scientist needs to include this feature in a gradient boosted tree model (XGBoost). What is the best encoding strategy?

A) One-hot encoding
B) Leave as-is (XGBoost can handle categorical data natively)
C) Target encoding with cross-validation to prevent leakage
D) Hash encoding with 100 hash buckets

Answer: C — With 42,000 unique values, one-hot encoding creates an impractically large sparse matrix. SageMaker's XGBoost implementation expects numerical input. Target encoding replaces each zip code with the mean target value, creating a single informative feature. Cross-validation prevents data leakage. Hash encoding (D) loses too much information with only 100 buckets and creates collisions.

Question 5 — Exploratory Data Analysis

A data scientist discovers that 35% of values in a critical feature are missing. Analysis shows the missingness correlates with the target variable (patients with severe conditions are less likely to complete surveys). What type of missing data is this, and how should it be handled?

A) Missing Completely At Random (MCAR) — drop rows with missing values
B) Missing At Random (MAR) — impute with mean values
C) Missing Not At Random (MNAR) — create a missingness indicator variable and use domain-specific imputation
D) Missing Completely At Random (MCAR) — impute with median values

Answer: C — When the probability of missingness depends on the unobserved (missing) value itself, this is MNAR. Simple imputation methods are biased for MNAR data. The best approach is to create a binary indicator variable ("is_missing") since the missingness itself carries information about severity, and apply domain-specific imputation. At 35% missingness, dropping rows would lose too much data.

Question 6 — Exploratory Data Analysis

A data scientist needs to visualize clusters in a high-dimensional dataset (200 features) to present findings to business stakeholders. Which technique should they use?

A) PCA to reduce to 2 dimensions and create a scatter plot
B) t-SNE to reduce to 2 dimensions and create a scatter plot
C) Create a 200-dimensional scatter plot
D) Use K-Means clustering labels on a bar chart

Answer: B — t-SNE is designed specifically for visualization of high-dimensional data in 2D/3D. It preserves local cluster structure, making clusters visually apparent in the resulting scatter plot. PCA (A) preserves global variance but often fails to show well-separated clusters when the structure is nonlinear. C is impossible. D only shows cluster sizes, not the data structure.

Question 7 — Modeling

A company wants to classify customer support tickets into 50 predefined categories using the ticket text. They have 500,000 labeled examples. Which SageMaker approach is the fastest to train?

A) Seq2Seq
B) Object2Vec
C) BlazingText in supervised mode
D) Custom TensorFlow BERT model on SageMaker

Answer: C — BlazingText in supervised mode is specifically optimized for text classification and is extremely fast (processes billions of words per minute). With 500,000 labeled examples and 50 categories, it is the ideal SageMaker built-in algorithm. Seq2Seq is for sequence-to-sequence tasks. Object2Vec is for embedding pairs. A custom BERT model would be powerful but much slower to train.

Question 8 — Modeling

A model achieves 92% accuracy on both training and validation sets, but the business team reports it performs poorly in production. Investigation shows the model was trained on data from January to June, but customer behavior shifted significantly in July. What happened?

A) The model is overfitting
B) The model is underfitting
C) Concept drift — the relationship between features and target changed
D) Data leakage during training

Answer: C — Good training and validation performance but poor production performance, combined with a temporal shift in behavior, is classic concept drift. The statistical relationship between input features and the target variable changed over time. This is not overfitting (which would show poor validation scores) or underfitting (poor training scores). Data leakage would typically show inflated validation scores.

Question 9 — Modeling

A SageMaker training job for a deep learning model is taking too long on a single ml.p3.2xlarge instance. The dataset is 1 TB. What is the most effective way to speed up training?

A) Switch to a larger instance type (ml.p3.16xlarge)
B) Use distributed training with data parallelism across multiple ml.p3.2xlarge instances
C) Reduce the dataset to 100 GB by random sampling
D) Convert data from CSV to Parquet

Answer: B — Data parallelism distributes the dataset across multiple instances, each processing a subset and synchronizing gradients. This is the standard approach for scaling deep learning training with large datasets. A single larger instance (A) helps but has limits. Reducing the dataset (C) hurts model quality. Parquet conversion (D) helps data loading but not the actual training computation.

Question 10 — Modeling

Q10

An e-commerce company wants to detect unusual login patterns from IP addresses to identify potential account takeover attacks. Which SageMaker algorithm is specifically designed for this use case?

A) Random Cut Forest
B) K-Means
C) IP Insights
D) XGBoost with engineered IP features

Answer: C — IP Insights is a SageMaker built-in algorithm specifically designed to learn the association between IP addresses and entities (users, accounts). It identifies anomalous IP usage patterns like logins from unusual IP addresses. While Random Cut Forest (A) detects general anomalies, IP Insights is purpose-built for IP-entity association analysis.

Question 11 — Modeling

Q11

A data scientist wants to quickly build a sentiment analysis model for product reviews but has no ML expertise. They have 10,000 labeled reviews. Which AWS approach requires the least ML knowledge?

A) Train a BlazingText model on SageMaker
B) Use Amazon Comprehend custom classification
C) Build a custom PyTorch model on SageMaker
D) Use Amazon Lex for sentiment analysis

Answer: B — Amazon Comprehend custom classification allows building custom text classifiers by simply providing labeled training data through the console or API. No ML expertise required. BlazingText (A) requires configuring SageMaker training jobs and hyperparameters. Custom PyTorch (C) requires deep ML knowledge. Lex (D) is for conversational AI, not sentiment analysis.

Question 12 — Modeling

Q12

A linear regression model has high training error and high validation error. The residual plot shows a clear curved pattern. What should the data scientist do?

A) Add more regularization to prevent overfitting
B) Add polynomial features or switch to a nonlinear model
C) Collect more training data
D) Remove features to simplify the model

Answer: B — High error on both sets (underfitting) combined with a curved residual pattern indicates the linear model cannot capture the nonlinear relationship in the data. Adding polynomial features allows the linear model to fit curves, or switching to a nonlinear model (XGBoost, neural network) would also work. More regularization (A) or fewer features (D) would make underfitting worse. More data (C) does not help when the model form is wrong.

Question 13 — ML Implementation

Q13

A machine learning team needs to deploy a model that will receive image payloads averaging 50 MB. Processing each image takes 30 seconds. The application can tolerate a response delay of up to 2 minutes. Which SageMaker deployment option is best?

A) Real-time endpoint
B) Batch Transform
C) Asynchronous Inference
D) Serverless Inference

Answer: C — Asynchronous Inference supports large payloads (up to 1 GB) and long processing times. It queues requests and notifies when complete via SNS. The 2-minute tolerance makes this a perfect fit. Real-time endpoints (A) have a 60-second timeout and 6 MB payload limit. Batch Transform (B) is for processing entire datasets, not individual requests. Serverless (C) has similar size limits to real-time.

Question 14 — ML Implementation

Q14

A company deploys a credit scoring model. Regulations require them to explain why each loan application is approved or denied. Which AWS service provides this capability?

A) SageMaker Debugger
B) SageMaker Clarify
C) SageMaker Model Monitor
D) Amazon CloudWatch

Answer: B — SageMaker Clarify provides feature attribution explanations using SHAP (Shapley Additive Explanations) values. For each prediction, it shows which features contributed most and in which direction, enabling the company to explain individual decisions. Debugger (A) is for training issues. Model Monitor (C) tracks drift, not explanations. CloudWatch (D) tracks operational metrics.

Question 15 — ML Implementation

Q15

A SageMaker endpoint hosts a TensorFlow model. The data science team wants to deploy a PyTorch model alongside it on the same endpoint to save costs, routing requests to the appropriate model based on a header. Which feature should they use?

A) Production variants
B) Multi-model endpoints
C) SageMaker Inference Pipelines
D) Multi-container endpoints

Answer: D — Multi-container endpoints allow hosting multiple containers (different frameworks) on the same endpoint, with requests routed to specific containers. This saves costs by sharing infrastructure. Production variants (A) split traffic randomly, not by header. Multi-model endpoints (B) host multiple models of the same framework type. Inference Pipelines (C) chain containers sequentially, not route between them.

Question 16 — Data Engineering

Q16

An ML team stores training data in S3. They need to ensure that only specific SageMaker execution roles can access the data, and all access is logged for auditing. Which combination of AWS features achieves this?

A) S3 bucket policies restricting access to specific IAM roles + S3 access logging or CloudTrail
B) S3 public access block + IAM user credentials
C) VPC endpoints only + no bucket policy
D) S3 encryption with KMS + no access restrictions

Answer: A — S3 bucket policies can explicitly allow only specific SageMaker execution roles (by ARN) and deny all others. CloudTrail and S3 access logging provide audit trails of all data access. B uses user credentials (not roles, and less secure). C restricts network path but not who can access. D encrypts data but does not restrict or audit access.

Question 17 — Modeling

Q17

A retail company wants to forecast sales for 1,000 products using 2 years of weekly sales data. Some products were launched recently and have only 3 months of data. Which approach handles this best?

A) Train individual ARIMA models for each product
B) Use SageMaker DeepAR, training on all products simultaneously
C) Use SageMaker Linear Learner with time-based features for each product
D) Use Amazon Forecast with AutoML

Answer: B — DeepAR excels at forecasting multiple related time series simultaneously. Critically, it can learn patterns from products with long history and apply those patterns to new products with limited data (cold-start problem). Individual ARIMA (A) cannot share information across products. Linear Learner (C) requires manual feature engineering. Amazon Forecast (D) could work but the question asks about the best approach, and DeepAR specifically handles the cold-start scenario described.

Question 18 — Exploratory Data Analysis

Q18

A data scientist finds that two features in the dataset have a Pearson correlation coefficient of 0.95. They plan to train a logistic regression model. What problem might this cause and how should they address it?

A) No problem — logistic regression handles correlated features well
B) Multicollinearity — unstable coefficients. Remove one feature or apply PCA
C) Data leakage — re-collect the data
D) Class imbalance — apply SMOTE

Answer: B — A correlation of 0.95 indicates severe multicollinearity. In logistic regression, this causes unstable coefficient estimates (small data changes lead to wildly different coefficients), making the model unreliable and difficult to interpret. Solutions: remove one of the correlated features, combine them via PCA, or use regularization (L1/L2). This is not data leakage or class imbalance.

Question 19 — ML Implementation

Q19

A company wants to reduce ML inference latency for a model deployed on IoT edge devices. The model was trained in TensorFlow on SageMaker. Which service should they use to optimize the model for edge hardware?

A) SageMaker Debugger
B) SageMaker Neo
C) AWS Lambda
D) SageMaker Autopilot

Answer: B — SageMaker Neo compiles trained models to optimized code for specific target hardware (ARM, x86, NVIDIA, Intel). This reduces model size and improves inference speed on edge devices. Debugger (A) analyzes training issues. Lambda (C) runs in the cloud, not on edge. Autopilot (D) is for AutoML training, not deployment optimization.

Question 20 — Modeling

Q20

A data scientist is training an image classification model on SageMaker. The training data consists of 100,000 images stored in S3. They want to minimize training cost while maintaining performance. Which approach is best?

A) Use the largest available GPU instance for fastest completion
B) Use SageMaker Managed Spot Training with checkpointing enabled
C) Use CPU instances since they are cheaper per hour
D) Reduce the image resolution to 32x32 pixels

Answer: B — Managed Spot Training uses EC2 Spot Instances at up to 90% discount. Checkpointing saves progress so training can resume if a spot instance is interrupted. This provides the biggest cost savings without sacrificing model quality. The largest instance (A) is fastest but most expensive. CPU instances (C) are much slower for image training, often costing more total. Reducing resolution (D) would significantly hurt image classification performance.

Question 21 — Data Engineering

Q21

A company has duplicate customer records across multiple databases. Before building an ML model, they need to identify and merge duplicate records. The duplicates are not exact matches (different spellings, abbreviations). Which AWS service provides ML-powered deduplication?

A) AWS Glue FindMatches
B) Amazon Comprehend
C) AWS Entity Resolution
D) Amazon Macie

Answer: A — AWS Glue FindMatches uses ML to identify matching records even with inconsistencies (fuzzy matching). You provide examples of matching and non-matching pairs, and it learns to find duplicates across the dataset. Comprehend (B) is for NLP tasks. While AWS Entity Resolution (C) also handles this, FindMatches is the established SageMaker exam answer. Macie (D) is for data security/privacy.

Question 22 — Modeling

Q22

A neural network model has high variance (overfitting). Which combination of regularization techniques is most likely to help? (Select TWO)

A) Increase learning rate
B) Add dropout layers (rate 0.3)
C) Add more hidden layers
D) Apply L2 regularization
E) Remove the validation set to use more training data

Answer: B and D — Dropout (B) randomly disables neurons during training, forcing the network to learn robust features and preventing co-adaptation. L2 regularization (D) penalizes large weights, constraining model complexity. Together, they are the standard approach for neural network overfitting. Increasing learning rate (A) may cause training instability. More layers (C) increases complexity, worsening overfitting. Removing validation (E) eliminates the ability to detect overfitting.

Question 23 — ML Implementation

Q23

A team needs an automated ML workflow that retrains their model weekly with new data, evaluates the model, and only deploys it if accuracy exceeds 95%. Which SageMaker feature supports this?

A) SageMaker Autopilot
B) SageMaker Pipelines with a Condition step
C) SageMaker Experiments
D) SageMaker Studio Notebooks with a cron job

Answer: B — SageMaker Pipelines defines ML workflows as DAGs with steps for processing, training, evaluation, and conditional logic. A Condition step can check if the evaluated accuracy exceeds 95% and only proceed to the model registration and deployment steps if it does. Autopilot (A) explores algorithms but does not provide conditional deployment. Experiments (C) tracks runs but does not orchestrate workflows. Notebooks with cron (D) is fragile and not production-grade.

Question 24 — Exploratory Data Analysis

Q24

A time-series dataset for stock price prediction is split randomly into 80% training and 20% validation. The model shows excellent validation performance. However, when deployed, it performs poorly on new data. What went wrong?

A) The model is underfitting
B) Temporal data leakage — random splitting allowed future data to influence predictions of past dates
C) The dataset is too small
D) The features need scaling

Answer: B — Random splitting of time-series data is a critical mistake. It allows the model to "see the future" during training, artificially inflating validation metrics. Time-series data must always be split chronologically (train on past, validate on future). The excellent validation performance was illusory because future information leaked into the training process.

Question 25 — ML Implementation

Q25

A company wants to use a custom R script for data preprocessing in their SageMaker ML pipeline. SageMaker does not provide a pre-built R container. What is the best approach?

A) Rewrite the script in Python
B) Build a custom Docker container with R and push it to Amazon ECR, then use it in a SageMaker Processing job
C) Run the R script on an EC2 instance outside SageMaker
D) Use AWS Lambda with an R runtime layer

Answer: B — The Bring Your Own Container (BYOC) approach allows using any language or framework with SageMaker. Build a Docker image with R and required libraries, push to ECR, and reference it in SageMaker Processing or Training jobs. This keeps the entire pipeline within SageMaker. Rewriting in Python (A) may not be feasible. EC2 (C) breaks the SageMaker pipeline integration. Lambda (D) has size and time limits unsuitable for data processing.

Score Yourself

Count your correct answers out of 25:

22-25 correct (88-100%) — Excellent! You are very likely to pass the real exam.
18-21 correct (72-84%) — Good. You are at or above the passing threshold. Review the topics you missed.
14-17 correct (56-68%) — Close but risky. Focus study time on your weakest domains before taking the real exam.
Below 14 (under 56%) — More study needed. Review the domain lessons thoroughly and take Practice Exam 2 after studying.

← Previous ML Implementation & Ops (20%) Next → Practice Exam 2