Intermediate

AI Development (25%)

Domain 2 of the CompTIA AI+ exam — data preparation, feature engineering, model training, evaluation metrics, and the complete ML pipeline.

Data Preparation

Data preparation is the most time-consuming phase of any ML project, typically consuming 60-80% of total project time.

Data Collection

Structured data — Organized in tables with rows and columns (databases, spreadsheets)
Unstructured data — No predefined format (text, images, audio, video)
Semi-structured data — Has some organization but not rigid (JSON, XML, emails)

Data Cleaning

Missing values — Remove rows, fill with mean/median/mode, or use imputation techniques
Duplicates — Identify and remove duplicate records
Outliers — Detect and handle extreme values that may skew the model
Inconsistent formats — Standardize date formats, units, text casing
Noise — Remove irrelevant or erroneous data points

Data Transformation

Normalization — Scale features to a range (0 to 1). Use when features have different scales.
Standardization — Transform to mean=0, std=1. Use when data follows a normal distribution.
Encoding — Convert categorical data to numbers (one-hot encoding, label encoding)
Data augmentation — Create additional training data by transforming existing data (rotate images, add noise to audio)

Feature Engineering

Feature engineering is the process of selecting, transforming, and creating input variables that help the model learn better.

Feature selection — Choose the most relevant features. Remove redundant or irrelevant ones.
Feature extraction — Create new features from existing ones (e.g., extract "day of week" from a date)
Feature scaling — Ensure all features are on a similar scale to prevent some from dominating
Dimensionality reduction — Reduce the number of features (PCA) to improve performance and reduce overfitting

Model Training

Data Splitting

Training set (70-80%) — Used to train the model
Validation set (10-15%) — Used to tune hyperparameters during training
Test set (10-15%) — Used for final evaluation. Never seen during training.

Training Concepts

Epoch — One complete pass through the entire training dataset
Batch size — Number of samples processed before updating model weights
Learning rate — How much to adjust weights on each update. Too high = unstable. Too low = slow.
Loss function — Measures how wrong the model's predictions are. Training minimizes this.
Optimizer — Algorithm that adjusts weights to minimize the loss (SGD, Adam)

Overfitting vs Underfitting

Overfitting — Model memorizes training data, performs poorly on new data. Solutions: more data, regularization, dropout, simpler model.
Underfitting — Model is too simple, performs poorly on both training and new data. Solutions: more features, complex model, more training.

💡

Transfer learning: Instead of training a model from scratch, start with a pre-trained model and fine-tune it on your specific data. This dramatically reduces training time and data requirements. Most modern AI applications use transfer learning.

Model Evaluation Metrics

Classification Metrics

Accuracy — Percentage of correct predictions. Misleading for imbalanced datasets.
Precision — Of all positive predictions, how many were actually positive? (Important when false positives are costly)
Recall (Sensitivity) — Of all actual positives, how many did we catch? (Important when false negatives are costly)
F1 Score — Harmonic mean of precision and recall. Use when you need a balance.
Confusion matrix — Table showing true positives, true negatives, false positives, false negatives
AUC-ROC — Measures model's ability to distinguish classes across all thresholds

Regression Metrics

MAE (Mean Absolute Error) — Average of absolute differences. Easy to interpret.
MSE (Mean Squared Error) — Average of squared differences. Penalizes large errors more.
RMSE (Root Mean Squared Error) — Square root of MSE. Same units as the target variable.
R-squared — Proportion of variance explained by the model (0 to 1, higher is better)

The ML Pipeline

Problem definition — Define what you are solving and success criteria
Data collection — Gather relevant data from various sources
Data preparation — Clean, transform, and prepare data
Feature engineering — Select and create input features
Model selection — Choose appropriate algorithm(s)
Training — Train the model on prepared data
Evaluation — Assess performance with appropriate metrics
Deployment — Put the model into production
Monitoring — Track performance and detect drift
Retraining — Update the model with new data periodically

Practice Questions

📝

Q1: A hospital builds a model to detect cancer from medical images. Missing a cancer case (false negative) is much worse than a false alarm (false positive). Which metric should they prioritize?

A) Accuracy
B) Precision
C) Recall
D) R-squared

Show Answer

C) Recall. Recall measures the percentage of actual cancer cases that the model correctly identifies. High recall means fewer missed cases (false negatives). In medical diagnosis, missing a cancer case is life-threatening, so recall is the priority metric even if it means more false positives.

📝

Q2: A model achieves 99% accuracy on training data but only 60% on test data. What is this problem called?

A) Underfitting
B) Overfitting
C) Data drift
D) Class imbalance

Show Answer

B) Overfitting. A large gap between training accuracy (99%) and test accuracy (60%) is the classic sign of overfitting. The model has memorized the training data instead of learning generalizable patterns. Solutions include regularization, more training data, dropout, and using a simpler model.

📝

Q3: Which data preparation technique converts categorical variables like "red", "blue", "green" into a format suitable for machine learning?

A) Normalization
B) Standardization
C) One-hot encoding
D) Data augmentation

Show Answer

C) One-hot encoding. One-hot encoding creates a binary column for each category. "red" becomes [1,0,0], "blue" becomes [0,1,0], "green" becomes [0,0,1]. This allows ML models to process categorical data as numerical input without implying ordinal relationships.

📝

Q4: What is the primary purpose of the validation set in model training?

A) Train the model on data
B) Tune hyperparameters and prevent overfitting
C) Provide final evaluation metrics
D) Augment the training data

Show Answer

B) Tune hyperparameters and prevent overfitting. The validation set is used during training to monitor performance and adjust hyperparameters (learning rate, model complexity). If the training loss decreases but validation loss increases, the model is overfitting. The test set (not validation) provides final evaluation.

📝

Q5: Instead of training a model from scratch on a small dataset, a team uses a pre-trained model and adjusts it with their specific data. What is this approach called?

A) Data augmentation
B) Ensemble learning
C) Transfer learning
D) Reinforcement learning

Show Answer

C) Transfer learning. Transfer learning leverages knowledge from a pre-trained model (trained on a large dataset) and fine-tunes it for a specific task with a smaller dataset. This is faster, requires less data, and often produces better results than training from scratch.

← Previous AI Concepts Next → AI Solutions