AI Development (25%)
Domain 2 of the CompTIA AI+ exam — data preparation, feature engineering, model training, evaluation metrics, and the complete ML pipeline.
Data Preparation
Data preparation is the most time-consuming phase of any ML project, typically consuming 60-80% of total project time.
Data Collection
- Structured data — Organized in tables with rows and columns (databases, spreadsheets)
- Unstructured data — No predefined format (text, images, audio, video)
- Semi-structured data — Has some organization but not rigid (JSON, XML, emails)
Data Cleaning
- Missing values — Remove rows, fill with mean/median/mode, or use imputation techniques
- Duplicates — Identify and remove duplicate records
- Outliers — Detect and handle extreme values that may skew the model
- Inconsistent formats — Standardize date formats, units, text casing
- Noise — Remove irrelevant or erroneous data points
Data Transformation
- Normalization — Scale features to a range (0 to 1). Use when features have different scales.
- Standardization — Transform to mean=0, std=1. Use when data follows a normal distribution.
- Encoding — Convert categorical data to numbers (one-hot encoding, label encoding)
- Data augmentation — Create additional training data by transforming existing data (rotate images, add noise to audio)
Feature Engineering
Feature engineering is the process of selecting, transforming, and creating input variables that help the model learn better.
- Feature selection — Choose the most relevant features. Remove redundant or irrelevant ones.
- Feature extraction — Create new features from existing ones (e.g., extract "day of week" from a date)
- Feature scaling — Ensure all features are on a similar scale to prevent some from dominating
- Dimensionality reduction — Reduce the number of features (PCA) to improve performance and reduce overfitting
Model Training
Data Splitting
- Training set (70-80%) — Used to train the model
- Validation set (10-15%) — Used to tune hyperparameters during training
- Test set (10-15%) — Used for final evaluation. Never seen during training.
Training Concepts
- Epoch — One complete pass through the entire training dataset
- Batch size — Number of samples processed before updating model weights
- Learning rate — How much to adjust weights on each update. Too high = unstable. Too low = slow.
- Loss function — Measures how wrong the model's predictions are. Training minimizes this.
- Optimizer — Algorithm that adjusts weights to minimize the loss (SGD, Adam)
Overfitting vs Underfitting
- Overfitting — Model memorizes training data, performs poorly on new data. Solutions: more data, regularization, dropout, simpler model.
- Underfitting — Model is too simple, performs poorly on both training and new data. Solutions: more features, complex model, more training.
Model Evaluation Metrics
Classification Metrics
- Accuracy — Percentage of correct predictions. Misleading for imbalanced datasets.
- Precision — Of all positive predictions, how many were actually positive? (Important when false positives are costly)
- Recall (Sensitivity) — Of all actual positives, how many did we catch? (Important when false negatives are costly)
- F1 Score — Harmonic mean of precision and recall. Use when you need a balance.
- Confusion matrix — Table showing true positives, true negatives, false positives, false negatives
- AUC-ROC — Measures model's ability to distinguish classes across all thresholds
Regression Metrics
- MAE (Mean Absolute Error) — Average of absolute differences. Easy to interpret.
- MSE (Mean Squared Error) — Average of squared differences. Penalizes large errors more.
- RMSE (Root Mean Squared Error) — Square root of MSE. Same units as the target variable.
- R-squared — Proportion of variance explained by the model (0 to 1, higher is better)
The ML Pipeline
- Problem definition — Define what you are solving and success criteria
- Data collection — Gather relevant data from various sources
- Data preparation — Clean, transform, and prepare data
- Feature engineering — Select and create input features
- Model selection — Choose appropriate algorithm(s)
- Training — Train the model on prepared data
- Evaluation — Assess performance with appropriate metrics
- Deployment — Put the model into production
- Monitoring — Track performance and detect drift
- Retraining — Update the model with new data periodically
Practice Questions
A) Accuracy
B) Precision
C) Recall
D) R-squared
Show Answer
C) Recall. Recall measures the percentage of actual cancer cases that the model correctly identifies. High recall means fewer missed cases (false negatives). In medical diagnosis, missing a cancer case is life-threatening, so recall is the priority metric even if it means more false positives.
A) Underfitting
B) Overfitting
C) Data drift
D) Class imbalance
Show Answer
B) Overfitting. A large gap between training accuracy (99%) and test accuracy (60%) is the classic sign of overfitting. The model has memorized the training data instead of learning generalizable patterns. Solutions include regularization, more training data, dropout, and using a simpler model.
A) Normalization
B) Standardization
C) One-hot encoding
D) Data augmentation
Show Answer
C) One-hot encoding. One-hot encoding creates a binary column for each category. "red" becomes [1,0,0], "blue" becomes [0,1,0], "green" becomes [0,0,1]. This allows ML models to process categorical data as numerical input without implying ordinal relationships.
A) Train the model on data
B) Tune hyperparameters and prevent overfitting
C) Provide final evaluation metrics
D) Augment the training data
Show Answer
B) Tune hyperparameters and prevent overfitting. The validation set is used during training to monitor performance and adjust hyperparameters (learning rate, model complexity). If the training loss decreases but validation loss increases, the model is overfitting. The test set (not validation) provides final evaluation.
A) Data augmentation
B) Ensemble learning
C) Transfer learning
D) Reinforcement learning
Show Answer
C) Transfer learning. Transfer learning leverages knowledge from a pre-trained model (trained on a large dataset) and fine-tunes it for a specific task with a smaller dataset. This is faster, requires less data, and often produces better results than training from scratch.
Lilly Tech Systems