Domain 1: Data Engineering (20%)
Master the AWS services and architectural patterns for building ML data pipelines — S3 data lakes, Kinesis streaming, Glue ETL, Athena queries, and EMR clusters.
Amazon S3 for ML Data
Amazon S3 is the backbone of every ML data pipeline on AWS. For the exam, you need to know:
S3 Storage Classes for ML
- S3 Standard — Active training data that is accessed frequently. Default choice for ML datasets.
- S3 Intelligent-Tiering — Data with unpredictable access patterns (experimental datasets).
- S3 Glacier — Long-term archival of raw data or old model artifacts. Not for active training.
S3 Data Formats for ML
The exam frequently tests which data format to use:
- CSV — Simple tabular data. Human-readable but slow for large datasets. No schema enforcement.
- Parquet — Columnar format. Best for analytics queries and when you need only specific columns. Preferred by Athena and Glue.
- RecordIO (protobuf) — SageMaker's preferred format for built-in algorithms. Supports pipe mode for streaming data directly to training jobs.
- JSON/JSONL — Semi-structured data. Common for NLP tasks and API data.
- TFRecord — TensorFlow's native format. Used when running TensorFlow on SageMaker.
Amazon Kinesis for Streaming ML
Kinesis is the real-time streaming family. The exam tests all four services:
Kinesis Data Streams
- Real-time data ingestion at scale (clickstreams, IoT, logs)
- You manage shards (each shard = 1 MB/s in, 2 MB/s out)
- Data retained for 24 hours (default) up to 365 days
- Consumers: Lambda, KCL applications, Kinesis Data Analytics
Kinesis Data Firehose
- Fully managed delivery stream — no shard management
- Delivers to S3, Redshift, Elasticsearch, or Splunk
- Can transform data with Lambda before delivery
- Near real-time (minimum 60-second buffer interval)
- Exam favorite: When the question says "easiest way to load streaming data into S3," the answer is Firehose
Kinesis Data Analytics
- Run SQL queries or Apache Flink applications on streaming data
- Real-time anomaly detection using RANDOM_CUT_FOREST SQL function
- Sliding window aggregations for time-series analysis
Kinesis Video Streams
- Ingest video for ML processing (Rekognition Video integration)
- Not commonly tested but know it exists for video-based ML scenarios
AWS Glue for ETL
AWS Glue is a fully managed ETL (Extract, Transform, Load) service that is central to the exam:
Glue Components
- Glue Data Catalog — Central metadata repository. Stores table definitions, schemas, and partition information. Used by Athena, Redshift Spectrum, and EMR.
- Glue Crawlers — Automatically discover data schemas by scanning S3, RDS, or DynamoDB. Populate the Data Catalog.
- Glue ETL Jobs — Serverless Spark jobs for data transformation. Auto-generate PySpark code from visual editor or write custom scripts.
- Glue FindMatches — ML-powered deduplication. Finds matching records without writing custom matching logic. Important for exam questions about data cleaning.
Glue for ML Data Preparation
- Convert CSV to Parquet for efficient querying
- Partition data by date/category for faster Athena queries
- Join data from multiple sources (S3 + RDS) into a unified dataset
- Handle schema evolution as data sources change
Amazon Athena for Data Exploration
- Serverless SQL queries directly on S3 data
- Uses Presto engine — pay per query ($5 per TB scanned)
- Works with Glue Data Catalog for schema information
- Exam tip: Use Parquet + partitioning to reduce Athena costs (less data scanned)
- Ideal for ad-hoc exploration before building ML pipelines
Amazon EMR for Large-Scale Processing
EMR (Elastic MapReduce) runs Hadoop/Spark clusters for big data processing:
- When to use EMR vs Glue: EMR for complex transformations, custom libraries, or when you need fine-grained cluster control. Glue for simpler, serverless ETL.
- EMR supports Spark MLlib for distributed ML training
- Use spot instances for cost savings on training clusters
- EMRFS provides consistent view of S3 data
Other Services to Know
AWS Data Pipeline
Orchestrates data movement between AWS services. Being replaced by Step Functions and Glue Workflows in most modern architectures, but still appears on the exam.
AWS Lake Formation
Simplifies building and managing data lakes. Provides fine-grained access control on top of S3 + Glue. Know it for questions about data governance and security.
Amazon Redshift
Data warehouse for structured analytics. Redshift ML can create ML models using SQL (uses SageMaker Autopilot under the hood). Know Redshift Spectrum for querying S3 data.
Data Engineering Architecture Patterns
Batch ML Pipeline
S3 (raw data) → Glue Crawler (catalog) → Glue ETL (transform) → S3 (processed, Parquet) → SageMaker (training)
Streaming ML Pipeline
IoT/Apps → Kinesis Data Streams → Lambda (transform) → Kinesis Firehose → S3 → SageMaker (training on accumulated data)
Real-Time Inference Pipeline
IoT/Apps → Kinesis Data Streams → Lambda (calls SageMaker endpoint) → DynamoDB (store predictions) + SNS (alerts)
Practice Questions
Test your knowledge with these exam-style questions. Try to answer before revealing the explanation.
A) Kinesis Data Streams with a custom KCL application writing to S3
B) Kinesis Data Firehose delivering directly to S3
C) Amazon MSK (Managed Kafka) with a Spark Streaming consumer
D) EC2 instances running Flume agents writing to S3
Answer: B — Kinesis Data Firehose is fully managed (no shard management, no custom consumers) and delivers directly to S3. The question asks for "minimal operational overhead," which eliminates A (requires managing KCL app), C (Kafka is more complex), and D (managing EC2 instances). Firehose handles scaling, batching, and delivery automatically.
A) CSV in File mode
B) Parquet in File mode
C) RecordIO (protobuf) in Pipe mode
D) JSON in File mode
Answer: C — RecordIO protobuf with Pipe mode streams data directly from S3 to the training algorithm without downloading the entire dataset first. This gives the fastest start time and lowest disk usage. File mode requires downloading all data before training begins. For 500 GB, this download time is significant.
A) EMR Spark job for conversion + custom dedup logic + manual Glue Catalog entries
B) AWS Glue Crawler (catalog) + Glue ETL (convert to Parquet) + Glue FindMatches (dedup)
C) Lambda functions for conversion + DynamoDB for dedup tracking + Glue Crawler
D) Athena CTAS queries for conversion + manual dedup with SQL + Glue Crawler
Answer: B — This is the fully managed, least-custom-code solution. Glue Crawler discovers schemas automatically, Glue ETL handles format conversion serverlessly, and Glue FindMatches provides ML-powered deduplication without writing matching logic. All other options require significantly more custom code.
A) Kinesis Data Streams → Kinesis Data Analytics (RANDOM_CUT_FOREST) → Lambda (alert) + Firehose (store to S3)
B) SQS → Lambda → SageMaker endpoint → S3
C) Kinesis Firehose → S3 → Athena scheduled query for anomalies
D) IoT Core → DynamoDB → DynamoDB Streams → Lambda anomaly check
Answer: A — Kinesis Data Analytics has a built-in RANDOM_CUT_FOREST function specifically designed for real-time anomaly detection on streaming data. This is a managed, real-time solution. B adds latency through SQS. C is not real-time (Athena is batch). D requires custom anomaly detection logic.
A) Load data into Amazon Redshift and query there
B) Use Amazon Athena with data stored in CSV format
C) Use Amazon Athena with data stored in Parquet format and partitioned by date
D) Spin up an EMR cluster with Hive for SQL queries
Answer: C — Athena charges per TB scanned. Parquet is columnar (only reads needed columns) and partitioning limits which files are scanned, dramatically reducing costs. CSV (B) scans all data. Redshift (A) and EMR (D) require provisioned infrastructure, increasing cost for ad-hoc exploration.
Key Takeaways for the Exam
- S3 is always the data lake foundation. Know storage classes and data formats (especially RecordIO for SageMaker).
- Kinesis Firehose = easiest way to get streaming data into S3. No shard management.
- Kinesis Data Analytics = real-time SQL + RANDOM_CUT_FOREST for anomaly detection.
- Glue = serverless ETL + Data Catalog + FindMatches for dedup.
- Athena = serverless SQL on S3. Parquet + partitioning = cost optimization.
- EMR = when you need custom Spark/Hadoop processing beyond what Glue offers.
- AWS always prefers managed/serverless solutions over self-managed infrastructure.
Lilly Tech Systems