Advanced

Practice Exam 2

25 additional exam-style questions with different scenarios from Practice Exam 1. Covers all 4 domains with detailed explanations for every answer. Target time: 70 minutes.

💡

Best practice: Take this exam at least 2 days after Practice Exam 1. Use the gap to study topics you got wrong in the first exam. A passing score is roughly 18/25 (72%).

Question 1 — Data Engineering

A company stores 5 TB of training data in Amazon S3 in CSV format. Athena queries for data exploration are slow and expensive. What is the most effective way to improve query performance and reduce costs?

A) Move data to Amazon Redshift
B) Convert CSV to Parquet format and partition by frequently filtered columns
C) Increase Athena query timeout
D) Create multiple smaller S3 buckets

Answer: B — Parquet is columnar, so Athena reads only the columns needed (instead of full rows in CSV). Partitioning means Athena only scans relevant subsets of data. Together, these can reduce data scanned (and cost) by 90%+. Redshift (A) adds infrastructure cost. Timeout (C) does not address performance. Smaller buckets (D) have no effect on query performance.

Question 2 — Data Engineering

A healthcare system needs to collect patient monitoring data from 10,000 IoT devices in real time, detect anomalous vital signs immediately, and store all data in S3 for historical analysis. Which architecture meets all requirements?

A) IoT Core → Kinesis Data Streams → Kinesis Data Analytics (anomaly detection with RANDOM_CUT_FOREST) + Kinesis Firehose (delivery to S3)
B) IoT Core → SQS → Lambda (anomaly check) → S3
C) IoT Core → DynamoDB → DynamoDB Streams → Lambda
D) IoT Core → S3 direct → Athena scheduled query for anomaly detection

Answer: A — This architecture handles all three requirements: Kinesis Data Streams ingests high-throughput IoT data, Kinesis Data Analytics provides real-time anomaly detection using the built-in RANDOM_CUT_FOREST algorithm, and Kinesis Firehose delivers all data to S3 for historical storage. SQS (B) is not designed for streaming analytics. DynamoDB (C) adds unnecessary complexity. Direct S3 + Athena (D) is not real-time.

Question 3 — Data Engineering

A data engineer needs to create a data pipeline that runs every day at midnight: extract data from an RDS database, transform it, and load it into S3 for SageMaker training. The pipeline should handle failures with automatic retries. Which service is best?

A) AWS Glue Workflows with triggers and crawlers
B) Lambda functions triggered by CloudWatch Events
C) Amazon MWAA (Managed Apache Airflow)
D) Step Functions with Glue ETL jobs

Answer: A — Glue Workflows orchestrate Glue crawlers and ETL jobs with triggers (scheduled or event-based), dependency management, and automatic retry logic. This is the most integrated solution for the described ETL pipeline. Lambda (B) has 15-minute timeout limits. MWAA (C) is powerful but more complex to set up for a simple daily pipeline. Step Functions (D) could work but adds orchestration complexity on top of Glue.

Question 4 — Exploratory Data Analysis

A dataset for predicting house prices has features: square footage, number of bedrooms, number of bathrooms, and age of house. The data scientist notices that square footage ranges from 500-10,000 while bedrooms range from 1-6. They plan to use KNN for prediction. What preprocessing step is essential?

A) One-hot encode the number of bedrooms
B) Apply feature scaling (normalization or standardization) to all features
C) Apply log transformation to all features
D) No preprocessing needed — KNN handles different scales

Answer: B — KNN uses distance calculations (Euclidean). Without scaling, square footage (range 9,500) would dominate the distance calculation, making bedrooms (range 5) nearly irrelevant. Feature scaling ensures all features contribute equally to distance. D is incorrect because KNN is one of the algorithms MOST sensitive to feature scales.

Question 5 — Exploratory Data Analysis

A binary classification dataset has 10,000 samples of class A and 50 samples of class B. A data scientist needs to create training and validation sets. Which splitting strategy is correct?

A) Random 80/20 split
B) Stratified split maintaining the class ratio in both sets
C) Put all class B samples in the training set
D) Oversample class B to 10,000 samples first, then random split

Answer: B — Stratified splitting ensures both training and validation sets maintain the original class distribution. With only 50 class B samples, a random split could result in the validation set having very few or no class B examples, making evaluation unreliable. Oversampling (D) should happen AFTER splitting, only on the training set, to prevent data leakage.

Question 6 — Exploratory Data Analysis

A data scientist notices that a "transaction_amount" feature has values ranging from $0.01 to $5,000,000 with a long right tail. The target variable is binary (fraud/not fraud). They plan to use logistic regression. Which transformation is most appropriate?

A) Min-max scaling to [0,1]
B) Standardization (z-score)
C) Log transformation followed by standardization
D) Binning into 10 equal-width bins

Answer: C — The extreme right skew (range from $0.01 to $5M) means log transformation should be applied first to compress the tail and create a more normal distribution. Then standardization prepares it for logistic regression. Min-max scaling (A) would compress most values near 0 due to the extreme range. Standardization alone (B) does not fix skewness. Equal-width bins (D) would put most values in the first bin.

Question 7 — Modeling

A company wants to extract text from scanned invoices, including table structures and form fields. Which AWS service is most appropriate?

A) Amazon Comprehend
B) Amazon Textract
C) Amazon Rekognition
D) SageMaker Object Detection

Answer: B — Amazon Textract is specifically designed for document analysis. It extracts text, tables, and forms from scanned documents, understanding the structure (which text belongs to which table cell, which value corresponds to which form field). Comprehend (A) analyzes text meaning, not document structure. Rekognition (C) detects objects in images, not document text. Object Detection (D) finds bounding boxes, not text content.

Question 8 — Modeling

A data scientist needs to segment customers into groups for targeted marketing. They have customer demographics and purchase history but no predefined group labels. Which approach is correct?

A) SageMaker XGBoost with customer satisfaction as the label
B) SageMaker K-Means clustering
C) Amazon Personalize
D) SageMaker Linear Learner for regression

Answer: B — Customer segmentation without predefined labels is an unsupervised learning problem. K-Means clustering groups similar customers together based on feature similarity. XGBoost (A) and Linear Learner (D) require labels (supervised). Personalize (C) is for recommendations, not segmentation.

Question 9 — Modeling

An autonomous driving company needs to identify and locate every object in dashcam images (cars, pedestrians, traffic signs) and classify each pixel. Which SageMaker algorithm should they use?

A) Image Classification
B) Object Detection
C) Semantic Segmentation
D) Random Cut Forest

Answer: C — "Classify each pixel" is the defining characteristic of semantic segmentation. It assigns a label to every pixel in the image (road, car, pedestrian, sky). Image Classification (A) labels the entire image with one category. Object Detection (B) finds bounding boxes but does not classify individual pixels. RCF (D) is for anomaly detection on tabular data.

Question 10 — Modeling

Q10

A model for predicting credit default uses 200 features. After training, the data scientist wants to understand which features have the most impact on predictions to explain decisions to regulators. Which technique should they use?

A) PCA to find the most important components
B) SHAP (Shapley Additive Explanations) values via SageMaker Clarify
C) t-SNE visualization
D) Correlation analysis

Answer: B — SHAP values provide per-prediction feature importance, showing which features pushed each individual prediction higher or lower. SageMaker Clarify computes SHAP values and generates explainability reports. PCA (A) transforms features, making them less interpretable. t-SNE (C) is for visualization. Correlation (D) shows feature relationships, not prediction contributions.

Question 11 — Modeling

Q11

A company wants to build a personalized product recommendation system for their e-commerce platform. They have user browsing history, purchase history, and product metadata. They want minimal ML expertise required. Which service should they use?

A) SageMaker Factorization Machines
B) SageMaker KNN
C) Amazon Personalize
D) SageMaker Object2Vec

Answer: C — Amazon Personalize is a fully managed recommendation service that requires minimal ML expertise. It automatically selects the best algorithm, handles feature engineering, trains the model, and provides real-time recommendations via API. The "minimal ML expertise" requirement is the key signal. Factorization Machines (A) and Object2Vec (D) require SageMaker ML knowledge.

Question 12 — Modeling

Q12

A SageMaker Automatic Model Tuning job for XGBoost has been running for 100 jobs. The data scientist notices the objective metric has plateaued after 60 jobs with no improvement. What should they do for future tuning?

A) Continue running more jobs with the same search space
B) Use warm start from the current tuning job but expand or modify the hyperparameter search ranges
C) Switch from Bayesian to grid search
D) Disable tuning and use default hyperparameters

Answer: B — Warm start carries forward knowledge from previous tuning jobs, avoiding re-exploration of already-tested areas. Expanding or modifying search ranges gives the optimizer new territory to explore. Continuing with the same space (A) wastes compute. Grid search (C) is less efficient. Default parameters (D) ignores the optimization opportunity.

Question 13 — ML Implementation

Q13

A SageMaker real-time endpoint serves a fraud detection model. The operations team notices that prediction latency spikes when multiple models in different containers need to be loaded. The endpoint hosts 100 different models (one per merchant). Which SageMaker feature is designed for this pattern?

A) Multi-container endpoints
B) Production variants
C) Multi-model endpoints
D) Inference Pipelines

Answer: C — Multi-model endpoints are designed to host hundreds or thousands of models on a single endpoint, dynamically loading and unloading models as needed. This is the standard pattern for one-model-per-tenant architectures. It shares compute resources and reduces costs compared to one endpoint per model. Multi-container (A) is for different frameworks. Production variants (B) is for A/B testing. Inference Pipelines (C) chains processing steps.

Question 14 — ML Implementation

Q14

A machine learning model needs to preprocess input data (feature scaling, encoding) before inference. The preprocessing logic is complex and implemented in a scikit-learn pipeline. How should this be deployed with the model on SageMaker?

A) Implement preprocessing in the client application before calling the endpoint
B) Use a SageMaker Inference Pipeline with a preprocessing container followed by the model container
C) Include preprocessing in the model container's inference script
D) Use a Lambda function for preprocessing, then call the SageMaker endpoint

Answer: B — SageMaker Inference Pipelines chain up to 15 containers sequentially, allowing a preprocessing container (scikit-learn) to transform data before passing it to the model container. This keeps preprocessing logic encapsulated, versioned, and co-deployed with the model. Client-side preprocessing (A) creates training-serving skew. Lambda (D) adds latency and maintenance burden. Including in the model container (C) works but mixes concerns and complicates updates.

Question 15 — ML Implementation

Q15

A company requires all ML training data to be encrypted at rest using customer-managed keys, and the training job must not have any internet access. Which SageMaker configuration is needed?

A) Enable SSL and use S3 default encryption
B) Specify a KMS key in the training job configuration and enable VPC mode with no internet gateway and no NAT gateway
C) Use S3 Glacier for encrypted storage
D) Enable network isolation mode on the training job

Answer: B — Customer-managed KMS keys in the training job configuration encrypt training volumes and artifacts. VPC mode without internet/NAT gateways ensures no internet access while still allowing access to S3 and other AWS services through VPC endpoints. Network isolation (D) prevents ALL network access, including to S3, which would prevent the job from accessing training data.

Question 16 — Data Engineering

Q16

A company receives data files from partners via SFTP. They need to automatically process each file (clean, transform, validate) and add it to their S3 data lake when it arrives. Which approach is most serverless?

A) AWS Transfer Family (SFTP) → S3 → S3 event notification → Lambda (trigger Glue ETL job)
B) EC2 instance running SFTP server → cron job to process files
C) Partner uploads directly to S3 → manual processing
D) AWS DataSync on a schedule

Answer: A — AWS Transfer Family provides a fully managed SFTP service that stores files directly in S3. S3 event notifications trigger a Lambda function when new files arrive, which starts a Glue ETL job for processing. This is entirely serverless and event-driven. EC2 (B) requires server management. Manual processing (C) is not automated. DataSync (D) is for bulk data migration, not event-driven processing.

Question 17 — Modeling

Q17

A data scientist trains a neural network for image classification. Training accuracy reaches 99% but validation accuracy is stuck at 72%. They have already added L2 regularization with no improvement. What should they try next?

A) Increase the learning rate
B) Add dropout layers and apply data augmentation
C) Remove the validation set
D) Add more layers to the network

Answer: B — The 27% gap indicates severe overfitting that L2 alone cannot fix. Dropout is complementary to L2 (regularizes differently by dropping random neurons). Data augmentation artificially increases training diversity (random rotations, flips, crops for images), which is one of the most effective ways to combat overfitting in image models. More layers (D) would increase overfitting. Removing validation (C) hides the problem.

Question 18 — Exploratory Data Analysis

Q18

A data scientist wants to use SageMaker to prepare data for multiple ML projects. They need to ensure that the same feature transformations are applied consistently during both training and real-time inference across teams. Which SageMaker capability should they use?

A) SageMaker Processing jobs
B) SageMaker Data Wrangler
C) SageMaker Feature Store
D) SageMaker Experiments

Answer: C — SageMaker Feature Store provides a centralized repository with both an online store (low-latency lookups for inference) and an offline store (bulk retrieval for training). This ensures training-serving consistency and allows feature reuse across teams and projects. Processing jobs (A) compute features but do not store them centrally. Data Wrangler (B) prepares data but does not provide a shared store. Experiments (C) tracks ML runs.

Question 19 — ML Implementation

Q19

A model monitoring system detects that the distribution of a key input feature has shifted significantly compared to the training baseline. Model accuracy has not dropped yet. What should the team do?

A) Immediately retrain the model
B) Ignore it since accuracy has not dropped
C) Investigate the data drift, begin collecting new labeled data for potential retraining, and increase monitoring frequency
D) Roll back to the previous model version

Answer: C — Data drift is an early warning sign. Accuracy may not have dropped YET, but concept drift often follows data drift. The prudent approach is to investigate the cause, start collecting labeled data from the new distribution for retraining, and increase monitoring. Ignoring it (B) risks a sudden accuracy drop. Immediate retraining (A) may be premature without new labeled data. Rolling back (D) does not help if the old model faces the same drift.

Question 20 — Modeling

Q20

A company wants to automatically translate customer reviews from 20 languages into English for analysis. They need a solution with no ML expertise and minimal setup. Which service should they use?

A) SageMaker Seq2Seq
B) Amazon Translate
C) Amazon Comprehend
D) BlazingText

Answer: B — Amazon Translate provides neural machine translation as a managed service with no ML expertise required. It supports 75+ languages, handles batch and real-time translation, and requires no model training. Seq2Seq (A) requires training a custom model. Comprehend (C) analyzes text meaning but does not translate. BlazingText (D) is for text classification and word embeddings.

Question 21 — Data Engineering

Q21

A SageMaker training job on a 200 GB dataset stored in S3 takes 30 minutes just to download data before training begins (using File mode). How can the team eliminate this download wait time?

A) Use a larger instance with more network bandwidth
B) Convert data to RecordIO format and use Pipe mode
C) Move data to EBS volume attached to the training instance
D) Compress the data with gzip

Answer: B — Pipe mode streams data directly from S3 to the training algorithm without downloading the entire dataset first. Training starts immediately as data streams in. RecordIO (protobuf) is the optimal format for pipe mode with SageMaker built-in algorithms. Larger instances (A) are faster at downloading but still wait. EBS (C) still requires copying data. Gzip (D) reduces transfer time but adds decompression overhead.

Question 22 — Modeling

Q22

A content moderation team needs to automatically detect inappropriate images uploaded to their platform. They need to classify images into categories (explicit, suggestive, violence, safe) with confidence scores. Which approach requires the least effort?

A) Train a custom SageMaker Image Classification model on labeled inappropriate images
B) Use Amazon Rekognition Content Moderation
C) Build a custom CNN on EC2 with TensorFlow
D) Use SageMaker Object Detection to find inappropriate objects

Answer: B — Amazon Rekognition Content Moderation is a pre-built, fully managed API that detects inappropriate content in images with confidence scores across multiple categories. No training required, no ML expertise needed. Custom models (A, C) require labeled training data and ML knowledge. Object Detection (D) finds objects but is not designed for content moderation categories.

Question 23 — ML Implementation

Q23

A team has a SageMaker notebook that works well for training models but wants to move to a production system with version control, automated testing, and reproducible training runs. Which combination of SageMaker features should they adopt?

A) SageMaker Studio only
B) SageMaker Pipelines + Model Registry + Experiments
C) SageMaker Autopilot for everything
D) AWS CodePipeline without SageMaker integration

Answer: B — SageMaker Pipelines provides reproducible, automated ML workflows (CI/CD for ML). Model Registry manages model versions with approval workflows. Experiments tracks training runs with parameters, metrics, and artifacts for reproducibility. Together, these form a production MLOps system. Studio (A) is an IDE, not an MLOps solution. Autopilot (C) is for AutoML, not pipeline management. CodePipeline without SageMaker (D) misses ML-specific capabilities.

Question 24 — Exploratory Data Analysis

Q24

A data scientist discovers that adding a "future_sales" column (next month's actual sales) as a feature dramatically improves their sales prediction model's validation performance. What is the problem?

A) Overfitting — remove the feature and add regularization
B) Data leakage — the feature contains information from the future that would not be available at prediction time
C) Underfitting — the model was too simple before
D) No problem — more features improve models

Answer: B — This is a textbook example of data leakage (also called target leakage). The "future_sales" column contains the very information the model is trying to predict. It would not be available at inference time when making real predictions. The artificially high validation performance is meaningless and the model would fail in production. Always ask: "Would this feature be available at the time of prediction?"

Question 25 — ML Implementation

Q25

A company running SageMaker training jobs wants to reduce costs by up to 90% without modifying their training code. The jobs can handle occasional interruptions. Which approach should they use?

A) Use smaller instance types
B) SageMaker Managed Spot Training with checkpointing
C) Reduce the training dataset size
D) Use SageMaker Savings Plans

Answer: B — Managed Spot Training uses EC2 Spot Instances at up to 90% discount. SageMaker handles spot interruptions automatically and resumes from checkpoints. No code changes needed — just enable spot training and specify a checkpoint S3 path in the training job configuration. Smaller instances (A) save less than 90%. Reducing data (C) hurts model quality. Savings Plans (D) provide up to 64% savings, not 90%.

Score Yourself

Count your correct answers out of 25:

22-25 correct (88-100%) — You are ready for the real exam. Schedule it with confidence.
18-21 correct (72-84%) — You are at the passing threshold. Review missed topics and consider one more week of study.
14-17 correct (56-68%) — Go back to the domain lessons for your weak areas. Retake this exam after studying.
Below 14 (under 56%) — Thorough review needed. Work through all domain lessons again before retaking.

💡

Comparing scores: If you improved from Practice Exam 1, you are on the right track. If you scored similarly or lower, identify which domains you are struggling with and review those lessons specifically.

← Previous Practice Exam 1 Next → Exam Day Tips & Resources