Beginner

AI/ML Fundamentals (20%)

Domain 1 of the AIF-C01 exam — understand the core concepts of artificial intelligence, machine learning, and deep learning, the types of ML, and the complete model lifecycle.

AI vs ML vs Deep Learning

These three terms are often used interchangeably, but they represent different levels of the same field. For the exam, you must know the distinctions clearly.

Artificial Intelligence (AI)

AI is the broadest term. It refers to any system that can perform tasks that normally require human intelligence. This includes rule-based systems, expert systems, and machine learning. A chatbot that follows a scripted decision tree is AI, even though it does not "learn."

Machine Learning (ML)

ML is a subset of AI where systems learn patterns from data rather than being explicitly programmed. Instead of writing rules, you provide training data and the algorithm discovers the rules. Key characteristic: the model improves as it sees more data.

Deep Learning (DL)

DL is a subset of ML that uses neural networks with multiple layers (hence "deep"). Deep learning excels at unstructured data: images, audio, text, and video. Most modern AI breakthroughs (GPT, image generation, speech recognition) use deep learning.

💡

Exam tip: Think of it as nested circles: AI (largest) contains ML, which contains DL. Every DL system is ML, and every ML system is AI, but not the reverse. The exam frequently tests this hierarchy.

Types of Machine Learning

Supervised Learning

The model learns from labeled data — input-output pairs where the correct answer is known. The goal is to learn a mapping from inputs to outputs so the model can predict outputs for new, unseen inputs.

Classification — Predict a category (spam vs. not spam, cat vs. dog, positive vs. negative sentiment)
Regression — Predict a continuous number (house price, temperature, stock price)

AWS examples: Amazon SageMaker built-in algorithms like XGBoost and Linear Learner are supervised. Amazon Comprehend sentiment analysis is supervised (trained on labeled sentiment data).

Unsupervised Learning

The model learns from unlabeled data — it finds hidden patterns and structures without being told what to look for.

Clustering — Group similar items together (customer segmentation, document grouping)
Dimensionality reduction — Reduce the number of features while keeping important information (PCA)
Anomaly detection — Identify unusual data points (fraud detection, system monitoring)

AWS examples: Amazon SageMaker K-Means for clustering. Random Cut Forest for anomaly detection.

Reinforcement Learning (RL)

The model learns by interacting with an environment and receiving rewards or penalties. It discovers the best strategy through trial and error. Think of it as training a dog with treats — the model learns which actions lead to rewards.

AWS example: AWS DeepRacer uses reinforcement learning to train autonomous racing models.

Semi-Supervised Learning

A practical middle ground: the model trains on a small amount of labeled data combined with a large amount of unlabeled data. This is common in real-world scenarios where labeling all data is expensive.

⚠

Exam focus: The exam loves asking "which type of learning is this?" Given a scenario, identify whether it is supervised (labeled data, predict output), unsupervised (no labels, find patterns), or reinforcement (agent, environment, rewards).

The Machine Learning Lifecycle

The ML lifecycle describes the end-to-end process of building and maintaining an ML system. For the exam, know each phase and what happens in it.

1. Business Problem Definition

Define what you are trying to solve. Is ML the right approach? What are the success metrics? This is the most overlooked step — many ML projects fail because the problem was not well-defined.

2. Data Collection and Preparation

Gather training data. Clean it (handle missing values, remove duplicates, fix errors). Transform it into a format suitable for ML. This phase typically takes 60-80% of the total project time.

3. Feature Engineering

Select, transform, and create the input variables (features) that the model will use. Good features are the difference between a mediocre model and an excellent one.

4. Model Training

Choose an algorithm and train the model on your prepared data. Split data into training, validation, and test sets. The model learns patterns from the training data.

5. Model Evaluation

Test the model on data it has never seen (the test set). Use appropriate metrics: accuracy, precision, recall, F1 score for classification; RMSE, MAE for regression. Determine if the model meets your success criteria.

6. Model Deployment

Put the model into production where it can make predictions on real data. This could be a real-time API endpoint, a batch processing job, or an edge device.

7. Model Monitoring and Maintenance

Continuously monitor model performance. Data changes over time (data drift), which can degrade model accuracy. Retrain the model periodically with fresh data.

💡

Key concept — Data Drift: When the statistical properties of the data the model receives in production differ from the data it was trained on, model performance degrades. This is called data drift (or concept drift). Monitoring and retraining are essential. AWS SageMaker Model Monitor detects data drift automatically.

Key Terminology for the Exam

Training data — Data used to teach the model
Validation data — Data used to tune model parameters during training
Test data — Data used for final evaluation (never seen during training)
Overfitting — Model memorizes training data but fails on new data (too complex)
Underfitting — Model is too simple to capture patterns in the data
Hyperparameters — Settings you configure before training (learning rate, number of layers)
Inference — Using a trained model to make predictions on new data
Epoch — One complete pass through the entire training dataset
Label — The correct answer in supervised learning (the "y" in input-output pairs)
Feature — An input variable used by the model to make predictions (the "x")

Common ML Evaluation Metrics

You do not need to calculate these for the exam, but you must know what they measure:

Accuracy — Percentage of correct predictions. Misleading for imbalanced datasets.
Precision — Of all positive predictions, how many were actually positive. Important when false positives are costly (spam filter).
Recall — Of all actual positives, how many did the model catch. Important when false negatives are costly (disease detection).
F1 Score — Harmonic mean of precision and recall. Use when you need a balance.
AUC-ROC — Measures the model's ability to distinguish between classes across all thresholds.
RMSE — Root Mean Squared Error for regression. Lower is better.

Practice Questions

📝

Q1: A company wants to group its customers into segments based on purchasing behavior, but does not have predefined categories. Which type of machine learning should they use?

A) Supervised learning - classification
B) Supervised learning - regression
C) Unsupervised learning - clustering
D) Reinforcement learning

Show Answer

C) Unsupervised learning - clustering. The company has no predefined categories (no labels), so supervised learning is out. They want to group similar customers, which is a clustering task. K-Means is a common clustering algorithm available in SageMaker.

📝

Q2: Which term describes when a machine learning model performs well on training data but poorly on new, unseen data?

A) Underfitting
B) Overfitting
C) Data drift
D) Feature engineering

Show Answer

B) Overfitting. Overfitting occurs when the model memorizes training data (including noise) instead of learning general patterns. It performs great on training data but fails on new data. Solutions include regularization, more training data, simpler models, or dropout (for neural networks).

📝

Q3: A hospital wants to predict whether patients have a rare disease. Only 1% of patients in the dataset are positive. The model predicts "no disease" for every patient and achieves 99% accuracy. What is the problem?

A) The model is overfitting
B) Accuracy is misleading for imbalanced datasets
C) The model needs more epochs
D) The model is underfitting

Show Answer

B) Accuracy is misleading for imbalanced datasets. When only 1% of cases are positive, always predicting negative gives 99% accuracy but catches zero actual cases. Better metrics for imbalanced data are recall (catches positives), precision, F1 score, or AUC-ROC.

📝

Q4: What is the relationship between AI, ML, and deep learning?

A) They are three completely separate fields
B) ML is the broadest field, containing both AI and DL
C) AI is the broadest field, ML is a subset of AI, and DL is a subset of ML
D) DL is the broadest field, containing both AI and ML

Show Answer

C) AI is the broadest field, ML is a subset of AI, and DL is a subset of ML. AI encompasses all intelligent systems. ML is a specific approach within AI that learns from data. Deep learning is a specific approach within ML that uses multi-layered neural networks.

📝

Q5: Which phase of the ML lifecycle typically consumes the most time?

A) Model training
B) Model deployment
C) Data collection and preparation
D) Model evaluation

Show Answer

C) Data collection and preparation. Data preparation (cleaning, transforming, handling missing values, feature engineering) typically takes 60-80% of total project time. This is a well-known fact in the ML industry and a common exam topic.

← Previous Exam Overview Next → Generative AI (24%)