AI/ML Fundamentals (20%)
Domain 1 of the AIF-C01 exam — understand the core concepts of artificial intelligence, machine learning, and deep learning, the types of ML, and the complete model lifecycle.
AI vs ML vs Deep Learning
These three terms are often used interchangeably, but they represent different levels of the same field. For the exam, you must know the distinctions clearly.
Artificial Intelligence (AI)
AI is the broadest term. It refers to any system that can perform tasks that normally require human intelligence. This includes rule-based systems, expert systems, and machine learning. A chatbot that follows a scripted decision tree is AI, even though it does not "learn."
Machine Learning (ML)
ML is a subset of AI where systems learn patterns from data rather than being explicitly programmed. Instead of writing rules, you provide training data and the algorithm discovers the rules. Key characteristic: the model improves as it sees more data.
Deep Learning (DL)
DL is a subset of ML that uses neural networks with multiple layers (hence "deep"). Deep learning excels at unstructured data: images, audio, text, and video. Most modern AI breakthroughs (GPT, image generation, speech recognition) use deep learning.
Types of Machine Learning
Supervised Learning
The model learns from labeled data — input-output pairs where the correct answer is known. The goal is to learn a mapping from inputs to outputs so the model can predict outputs for new, unseen inputs.
- Classification — Predict a category (spam vs. not spam, cat vs. dog, positive vs. negative sentiment)
- Regression — Predict a continuous number (house price, temperature, stock price)
AWS examples: Amazon SageMaker built-in algorithms like XGBoost and Linear Learner are supervised. Amazon Comprehend sentiment analysis is supervised (trained on labeled sentiment data).
Unsupervised Learning
The model learns from unlabeled data — it finds hidden patterns and structures without being told what to look for.
- Clustering — Group similar items together (customer segmentation, document grouping)
- Dimensionality reduction — Reduce the number of features while keeping important information (PCA)
- Anomaly detection — Identify unusual data points (fraud detection, system monitoring)
AWS examples: Amazon SageMaker K-Means for clustering. Random Cut Forest for anomaly detection.
Reinforcement Learning (RL)
The model learns by interacting with an environment and receiving rewards or penalties. It discovers the best strategy through trial and error. Think of it as training a dog with treats — the model learns which actions lead to rewards.
AWS example: AWS DeepRacer uses reinforcement learning to train autonomous racing models.
Semi-Supervised Learning
A practical middle ground: the model trains on a small amount of labeled data combined with a large amount of unlabeled data. This is common in real-world scenarios where labeling all data is expensive.
The Machine Learning Lifecycle
The ML lifecycle describes the end-to-end process of building and maintaining an ML system. For the exam, know each phase and what happens in it.
1. Business Problem Definition
Define what you are trying to solve. Is ML the right approach? What are the success metrics? This is the most overlooked step — many ML projects fail because the problem was not well-defined.
2. Data Collection and Preparation
Gather training data. Clean it (handle missing values, remove duplicates, fix errors). Transform it into a format suitable for ML. This phase typically takes 60-80% of the total project time.
3. Feature Engineering
Select, transform, and create the input variables (features) that the model will use. Good features are the difference between a mediocre model and an excellent one.
4. Model Training
Choose an algorithm and train the model on your prepared data. Split data into training, validation, and test sets. The model learns patterns from the training data.
5. Model Evaluation
Test the model on data it has never seen (the test set). Use appropriate metrics: accuracy, precision, recall, F1 score for classification; RMSE, MAE for regression. Determine if the model meets your success criteria.
6. Model Deployment
Put the model into production where it can make predictions on real data. This could be a real-time API endpoint, a batch processing job, or an edge device.
7. Model Monitoring and Maintenance
Continuously monitor model performance. Data changes over time (data drift), which can degrade model accuracy. Retrain the model periodically with fresh data.
Key Terminology for the Exam
- Training data — Data used to teach the model
- Validation data — Data used to tune model parameters during training
- Test data — Data used for final evaluation (never seen during training)
- Overfitting — Model memorizes training data but fails on new data (too complex)
- Underfitting — Model is too simple to capture patterns in the data
- Hyperparameters — Settings you configure before training (learning rate, number of layers)
- Inference — Using a trained model to make predictions on new data
- Epoch — One complete pass through the entire training dataset
- Label — The correct answer in supervised learning (the "y" in input-output pairs)
- Feature — An input variable used by the model to make predictions (the "x")
Common ML Evaluation Metrics
You do not need to calculate these for the exam, but you must know what they measure:
- Accuracy — Percentage of correct predictions. Misleading for imbalanced datasets.
- Precision — Of all positive predictions, how many were actually positive. Important when false positives are costly (spam filter).
- Recall — Of all actual positives, how many did the model catch. Important when false negatives are costly (disease detection).
- F1 Score — Harmonic mean of precision and recall. Use when you need a balance.
- AUC-ROC — Measures the model's ability to distinguish between classes across all thresholds.
- RMSE — Root Mean Squared Error for regression. Lower is better.
Practice Questions
A) Supervised learning - classification
B) Supervised learning - regression
C) Unsupervised learning - clustering
D) Reinforcement learning
Show Answer
C) Unsupervised learning - clustering. The company has no predefined categories (no labels), so supervised learning is out. They want to group similar customers, which is a clustering task. K-Means is a common clustering algorithm available in SageMaker.
A) Underfitting
B) Overfitting
C) Data drift
D) Feature engineering
Show Answer
B) Overfitting. Overfitting occurs when the model memorizes training data (including noise) instead of learning general patterns. It performs great on training data but fails on new data. Solutions include regularization, more training data, simpler models, or dropout (for neural networks).
A) The model is overfitting
B) Accuracy is misleading for imbalanced datasets
C) The model needs more epochs
D) The model is underfitting
Show Answer
B) Accuracy is misleading for imbalanced datasets. When only 1% of cases are positive, always predicting negative gives 99% accuracy but catches zero actual cases. Better metrics for imbalanced data are recall (catches positives), precision, F1 score, or AUC-ROC.
A) They are three completely separate fields
B) ML is the broadest field, containing both AI and DL
C) AI is the broadest field, ML is a subset of AI, and DL is a subset of ML
D) DL is the broadest field, containing both AI and ML
Show Answer
C) AI is the broadest field, ML is a subset of AI, and DL is a subset of ML. AI encompasses all intelligent systems. ML is a specific approach within AI that learns from data. Deep learning is a specific approach within ML that uses multi-layered neural networks.
A) Model training
B) Model deployment
C) Data collection and preparation
D) Model evaluation
Show Answer
C) Data collection and preparation. Data preparation (cleaning, transforming, handling missing values, feature engineering) typically takes 60-80% of total project time. This is a well-known fact in the ML industry and a common exam topic.