Intermediate

Domain 1: Data Engineering (20%)

Master the AWS services and architectural patterns for building ML data pipelines — S3 data lakes, Kinesis streaming, Glue ETL, Athena queries, and EMR clusters.

💡
Exam weight: This domain accounts for 20% of your score (~13 questions). Focus on understanding when to use each service, not just what it does.

Amazon S3 for ML Data

Amazon S3 is the backbone of every ML data pipeline on AWS. For the exam, you need to know:

S3 Storage Classes for ML

  • S3 Standard — Active training data that is accessed frequently. Default choice for ML datasets.
  • S3 Intelligent-Tiering — Data with unpredictable access patterns (experimental datasets).
  • S3 Glacier — Long-term archival of raw data or old model artifacts. Not for active training.

S3 Data Formats for ML

The exam frequently tests which data format to use:

  • CSV — Simple tabular data. Human-readable but slow for large datasets. No schema enforcement.
  • Parquet — Columnar format. Best for analytics queries and when you need only specific columns. Preferred by Athena and Glue.
  • RecordIO (protobuf) — SageMaker's preferred format for built-in algorithms. Supports pipe mode for streaming data directly to training jobs.
  • JSON/JSONL — Semi-structured data. Common for NLP tasks and API data.
  • TFRecord — TensorFlow's native format. Used when running TensorFlow on SageMaker.
Exam trap: When a question asks about the "most efficient" data format for SageMaker built-in algorithms, the answer is almost always RecordIO (protobuf) with pipe mode. This streams data directly from S3 without downloading to disk first.

Amazon Kinesis for Streaming ML

Kinesis is the real-time streaming family. The exam tests all four services:

Kinesis Data Streams

  • Real-time data ingestion at scale (clickstreams, IoT, logs)
  • You manage shards (each shard = 1 MB/s in, 2 MB/s out)
  • Data retained for 24 hours (default) up to 365 days
  • Consumers: Lambda, KCL applications, Kinesis Data Analytics

Kinesis Data Firehose

  • Fully managed delivery stream — no shard management
  • Delivers to S3, Redshift, Elasticsearch, or Splunk
  • Can transform data with Lambda before delivery
  • Near real-time (minimum 60-second buffer interval)
  • Exam favorite: When the question says "easiest way to load streaming data into S3," the answer is Firehose

Kinesis Data Analytics

  • Run SQL queries or Apache Flink applications on streaming data
  • Real-time anomaly detection using RANDOM_CUT_FOREST SQL function
  • Sliding window aggregations for time-series analysis

Kinesis Video Streams

  • Ingest video for ML processing (Rekognition Video integration)
  • Not commonly tested but know it exists for video-based ML scenarios
💡
Decision tree: Need to manage shards and have multiple consumers? Kinesis Data Streams. Just need to dump streaming data into S3? Kinesis Data Firehose. Need real-time SQL on streams? Kinesis Data Analytics.

AWS Glue for ETL

AWS Glue is a fully managed ETL (Extract, Transform, Load) service that is central to the exam:

Glue Components

  • Glue Data Catalog — Central metadata repository. Stores table definitions, schemas, and partition information. Used by Athena, Redshift Spectrum, and EMR.
  • Glue Crawlers — Automatically discover data schemas by scanning S3, RDS, or DynamoDB. Populate the Data Catalog.
  • Glue ETL Jobs — Serverless Spark jobs for data transformation. Auto-generate PySpark code from visual editor or write custom scripts.
  • Glue FindMatches — ML-powered deduplication. Finds matching records without writing custom matching logic. Important for exam questions about data cleaning.

Glue for ML Data Preparation

  • Convert CSV to Parquet for efficient querying
  • Partition data by date/category for faster Athena queries
  • Join data from multiple sources (S3 + RDS) into a unified dataset
  • Handle schema evolution as data sources change

Amazon Athena for Data Exploration

  • Serverless SQL queries directly on S3 data
  • Uses Presto engine — pay per query ($5 per TB scanned)
  • Works with Glue Data Catalog for schema information
  • Exam tip: Use Parquet + partitioning to reduce Athena costs (less data scanned)
  • Ideal for ad-hoc exploration before building ML pipelines

Amazon EMR for Large-Scale Processing

EMR (Elastic MapReduce) runs Hadoop/Spark clusters for big data processing:

  • When to use EMR vs Glue: EMR for complex transformations, custom libraries, or when you need fine-grained cluster control. Glue for simpler, serverless ETL.
  • EMR supports Spark MLlib for distributed ML training
  • Use spot instances for cost savings on training clusters
  • EMRFS provides consistent view of S3 data

Other Services to Know

AWS Data Pipeline

Orchestrates data movement between AWS services. Being replaced by Step Functions and Glue Workflows in most modern architectures, but still appears on the exam.

AWS Lake Formation

Simplifies building and managing data lakes. Provides fine-grained access control on top of S3 + Glue. Know it for questions about data governance and security.

Amazon Redshift

Data warehouse for structured analytics. Redshift ML can create ML models using SQL (uses SageMaker Autopilot under the hood). Know Redshift Spectrum for querying S3 data.

Data Engineering Architecture Patterns

Batch ML Pipeline

S3 (raw data) → Glue Crawler (catalog) → Glue ETL (transform) → S3 (processed, Parquet) → SageMaker (training)

Streaming ML Pipeline

IoT/Apps → Kinesis Data Streams → Lambda (transform) → Kinesis Firehose → S3 → SageMaker (training on accumulated data)

Real-Time Inference Pipeline

IoT/Apps → Kinesis Data Streams → Lambda (calls SageMaker endpoint) → DynamoDB (store predictions) + SNS (alerts)


Practice Questions

Test your knowledge with these exam-style questions. Try to answer before revealing the explanation.

Q1
A company needs to ingest clickstream data from its website in real time and load it into S3 for daily ML model retraining. The solution should require minimal operational overhead. Which approach should they use?

A) Kinesis Data Streams with a custom KCL application writing to S3
B) Kinesis Data Firehose delivering directly to S3
C) Amazon MSK (Managed Kafka) with a Spark Streaming consumer
D) EC2 instances running Flume agents writing to S3

Answer: B — Kinesis Data Firehose is fully managed (no shard management, no custom consumers) and delivers directly to S3. The question asks for "minimal operational overhead," which eliminates A (requires managing KCL app), C (Kafka is more complex), and D (managing EC2 instances). Firehose handles scaling, batching, and delivery automatically.
Q2
A data scientist needs to train a SageMaker built-in algorithm on a 500 GB tabular dataset stored in S3. What data format and input mode combination will result in the fastest training start time?

A) CSV in File mode
B) Parquet in File mode
C) RecordIO (protobuf) in Pipe mode
D) JSON in File mode

Answer: C — RecordIO protobuf with Pipe mode streams data directly from S3 to the training algorithm without downloading the entire dataset first. This gives the fastest start time and lowest disk usage. File mode requires downloading all data before training begins. For 500 GB, this download time is significant.
Q3
A company has raw CSV files in S3 that need to be converted to Parquet, deduplicated, and cataloged for use by both Athena and SageMaker. Which combination of services achieves this with the least custom code?

A) EMR Spark job for conversion + custom dedup logic + manual Glue Catalog entries
B) AWS Glue Crawler (catalog) + Glue ETL (convert to Parquet) + Glue FindMatches (dedup)
C) Lambda functions for conversion + DynamoDB for dedup tracking + Glue Crawler
D) Athena CTAS queries for conversion + manual dedup with SQL + Glue Crawler

Answer: B — This is the fully managed, least-custom-code solution. Glue Crawler discovers schemas automatically, Glue ETL handles format conversion serverlessly, and Glue FindMatches provides ML-powered deduplication without writing matching logic. All other options require significantly more custom code.
Q4
An IoT platform generates sensor data that needs real-time anomaly detection before being stored. Which architecture best supports this requirement?

A) Kinesis Data Streams → Kinesis Data Analytics (RANDOM_CUT_FOREST) → Lambda (alert) + Firehose (store to S3)
B) SQS → Lambda → SageMaker endpoint → S3
C) Kinesis Firehose → S3 → Athena scheduled query for anomalies
D) IoT Core → DynamoDB → DynamoDB Streams → Lambda anomaly check

Answer: A — Kinesis Data Analytics has a built-in RANDOM_CUT_FOREST function specifically designed for real-time anomaly detection on streaming data. This is a managed, real-time solution. B adds latency through SQS. C is not real-time (Athena is batch). D requires custom anomaly detection logic.
Q5
A machine learning team wants to query their S3 data lake using SQL to understand data distributions before model training. They want to minimize costs. What should they do?

A) Load data into Amazon Redshift and query there
B) Use Amazon Athena with data stored in CSV format
C) Use Amazon Athena with data stored in Parquet format and partitioned by date
D) Spin up an EMR cluster with Hive for SQL queries

Answer: C — Athena charges per TB scanned. Parquet is columnar (only reads needed columns) and partitioning limits which files are scanned, dramatically reducing costs. CSV (B) scans all data. Redshift (A) and EMR (D) require provisioned infrastructure, increasing cost for ad-hoc exploration.

Key Takeaways for the Exam

  • S3 is always the data lake foundation. Know storage classes and data formats (especially RecordIO for SageMaker).
  • Kinesis Firehose = easiest way to get streaming data into S3. No shard management.
  • Kinesis Data Analytics = real-time SQL + RANDOM_CUT_FOREST for anomaly detection.
  • Glue = serverless ETL + Data Catalog + FindMatches for dedup.
  • Athena = serverless SQL on S3. Parquet + partitioning = cost optimization.
  • EMR = when you need custom Spark/Hadoop processing beyond what Glue offers.
  • AWS always prefers managed/serverless solutions over self-managed infrastructure.