Explore Data & Train Models (25-30%) Intermediate
This domain tests your ability to perform exploratory data analysis, engineer features, run AutoML experiments, and write custom training scripts. It is the second-highest weighted domain and requires both conceptual understanding and hands-on SDK knowledge.
Exploratory Data Analysis (EDA)
Before training any model, you need to understand your data. The exam tests your knowledge of EDA techniques and how to perform them in Azure ML notebooks.
Key EDA Tasks for the Exam
- Summary statistics — Mean, median, standard deviation, percentiles using
df.describe() - Missing value analysis — Identify nulls with
df.isnull().sum(), decide imputation vs. removal - Distribution analysis — Histograms, box plots, skewness detection
- Correlation analysis — Pearson/Spearman correlation matrices, multicollinearity detection
- Outlier detection — IQR method, Z-score, isolation forests
- Class imbalance — Check target variable distribution for classification tasks
# EDA in Azure ML notebook (exam-relevant patterns)
import pandas as pd
import matplotlib.pyplot as plt
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
# Load data from registered data asset
ml_client = MLClient(
DefaultAzureCredential(),
subscription_id="...",
resource_group_name="dp100-rg",
workspace_name="dp100-workspace"
)
# Get data asset URI and load
data_asset = ml_client.data.get("customer-churn-data", version="1")
df = pd.read_csv(data_asset.path)
# Summary statistics
print(df.describe())
print(f"\nShape: {df.shape}")
print(f"\nMissing values:\n{df.isnull().sum()}")
# Check class balance for classification
print(f"\nTarget distribution:\n{df['churn'].value_counts(normalize=True)}")
# Correlation matrix
correlation = df.select_dtypes(include='number').corr()
print(f"\nHigh correlations (>0.8):")
for col in correlation.columns:
for idx in correlation.index:
if col != idx and abs(correlation.loc[idx, col]) > 0.8:
print(f" {col} <-> {idx}: {correlation.loc[idx, col]:.3f}")
Feature Engineering
The exam tests your understanding of common feature engineering techniques and when to apply them.
| Technique | When to Use | Azure ML Approach |
|---|---|---|
| One-hot encoding | Categorical features with few unique values | pd.get_dummies() or sklearn OneHotEncoder |
| Label encoding | Ordinal categorical features | sklearn LabelEncoder |
| Normalization / Scaling | Features with different scales | StandardScaler, MinMaxScaler |
| Log transform | Highly skewed numerical features | np.log1p() |
| Binning | Converting continuous to categorical | pd.cut() or pd.qcut() |
| Feature selection | Reducing dimensionality, removing noise | Variance threshold, mutual information, RFE |
| Polynomial features | Capturing non-linear relationships | sklearn PolynomialFeatures |
AutoML in Azure ML
AutoML automates model selection, hyperparameter tuning, and feature engineering. The exam heavily tests AutoML configuration and understanding output.
# Configure and run AutoML with SDK v2
from azure.ai.ml import automl, Input
# Classification task
classification_job = automl.classification(
compute="dp100-cluster",
experiment_name="churn-automl",
training_data=Input(
type="mltable",
path="azureml://datastores/workspaceblobstore/paths/data/churn-mltable/"
),
target_column_name="churn",
primary_metric="AUC_weighted",
# Key configuration options for the exam:
enable_model_explainability=True, # Generate feature importance
enable_early_termination=True, # Stop poor-performing runs
n_cross_validations=5, # K-fold cross validation
max_trials=20, # Max models to try
max_concurrent_trials=4, # Parallel experiments
timeout_minutes=60, # Total time budget
# Allowed/blocked models
allowed_training_algorithms=[
"LogisticRegression",
"LightGBM",
"RandomForest",
"GradientBoosting",
"XGBoostClassifier"
],
# Featurization settings
featurization="auto" # auto, off, or custom
)
# Submit the job
returned_job = ml_client.jobs.create_or_update(classification_job)
print(f"Job URL: {returned_job.studio_url}")
AutoML Task Types
| Task | Function | Common Metrics |
|---|---|---|
| Classification | automl.classification() | AUC_weighted, accuracy, F1_score_weighted |
| Regression | automl.regression() | normalized_RMSE, R2_score, MAE |
| Forecasting | automl.forecasting() | normalized_RMSE, MAPE |
| Image Classification | automl.image_classification() | accuracy, AUC |
| Object Detection | automl.image_object_detection() | mAP (mean Average Precision) |
| NLP Text Classification | automl.text_classification() | accuracy, AUC_weighted |
Custom Training Scripts
When AutoML is not sufficient, you write custom training scripts and submit them as command jobs.
# Custom training script: train.py
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report
def main():
# Parse arguments (passed from pipeline/command job)
parser = argparse.ArgumentParser()
parser.add_argument("--input-data", type=str, required=True)
parser.add_argument("--learning-rate", type=float, default=0.1)
parser.add_argument("--n-estimators", type=int, default=100)
parser.add_argument("--max-depth", type=int, default=3)
args = parser.parse_args()
# Enable MLflow autologging
mlflow.sklearn.autolog()
# Load data
df = pd.read_csv(args.input_data)
# Split features and target
X = df.drop("churn", axis=1)
y = df["churn"]
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train model
model = GradientBoostingClassifier(
learning_rate=args.learning_rate,
n_estimators=args.n_estimators,
max_depth=args.max_depth,
random_state=42
)
model.fit(X_train, y_train)
# Evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
# Log metrics (MLflow)
mlflow.log_metric("accuracy", accuracy)
mlflow.log_metric("auc", auc)
print(f"Accuracy: {accuracy:.4f}")
print(f"AUC: {auc:.4f}")
print(classification_report(y_test, predictions))
if __name__ == "__main__":
main()
# Submit custom training as a command job
from azure.ai.ml import command, Input
# Define command job
job = command(
code="./src", # Local folder with train.py
command="python train.py --input-data ${{inputs.data}} --learning-rate ${{inputs.lr}} --n-estimators ${{inputs.n_est}}",
inputs={
"data": Input(
type="uri_file",
path="azureml://datastores/workspaceblobstore/paths/data/churn.csv"
),
"lr": 0.05,
"n_est": 200
},
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
compute="dp100-cluster",
experiment_name="churn-custom-training",
display_name="gradient-boosting-v1"
)
returned_job = ml_client.jobs.create_or_update(job)
print(f"Job URL: {returned_job.studio_url}")
Hyperparameter Tuning (Sweep Jobs)
Azure ML sweep jobs automate hyperparameter search. Know the sampling methods and early termination policies for the exam.
# Hyperparameter sweep job
from azure.ai.ml.sweep import Choice, Uniform, BanditPolicy
# Convert command to sweep
sweep_job = job.sweep(
sampling_algorithm="bayesian", # random, grid, or bayesian
primary_metric="auc",
goal="maximize",
)
# Define search space
sweep_job.set_limits(
max_total_trials=20,
max_concurrent_trials=4,
timeout=3600 # seconds
)
# Early termination policy
sweep_job.early_termination = BanditPolicy(
slack_factor=0.1, # Allow 10% slack from best
evaluation_interval=2, # Check every 2 trials
delay_evaluation=5 # Skip first 5 trials
)
# Search space overrides
sweep_job.inputs.lr = Uniform(min_value=0.001, max_value=0.3)
sweep_job.inputs.n_est = Choice(values=[50, 100, 200, 500])
returned_sweep = ml_client.jobs.create_or_update(sweep_job)
Practice Questions
A. accuracy
B. AUC_weighted
C. norm_macro_recall
D. precision_score_weighted
Show Answer
B. AUC_weighted. For imbalanced datasets, accuracy is misleading (a model predicting all non-fraud gets 98% accuracy). AUC_weighted evaluates the model's ability to distinguish between classes regardless of threshold and handles class imbalance well. Norm_macro_recall could work but AUC_weighted is the recommended default for imbalanced classification in Azure AutoML.
A. featurization
B. allowed_training_algorithms
C. primary_metric
D. max_trials
Show Answer
B. allowed_training_algorithms. This parameter restricts AutoML to only the specified algorithms (e.g., LightGBM, RandomForest, XGBoostRegressor). Featurization controls data preprocessing. Primary_metric sets the optimization target. Max_trials limits the number of experiments.
A. Random
B. Grid
C. Bayesian
D. Sobol
Show Answer
B. Grid. With only 27 total combinations and a budget for 27 trials, grid search will exhaustively test every combination, guaranteeing you find the best configuration. Random sampling might miss some combinations. Bayesian is more efficient for large search spaces but unnecessary when you can afford exhaustive search.
A. TensorBoard
B. Weights & Biases
C. MLflow
D. Neptune.ai
Show Answer
C. MLflow. Azure ML natively integrates with MLflow for experiment tracking, model logging, and model registry. You can use mlflow.log_metric(), mlflow.log_artifact(), and mlflow.sklearn.autolog() directly in Azure ML training scripts. TensorBoard is supported for visualization but MLflow is the primary tracking framework.
A. Terminates runs that are 10% slower than the fastest run
B. Terminates runs whose primary metric is more than 10% worse than the best run, checked every 2 intervals
C. Allocates 10% more resources to the top 2 runs
D. Randomly terminates 10% of runs every 2 intervals
Show Answer
B. Terminates runs whose primary metric is more than 10% worse than the best run, checked every 2 intervals. The Bandit policy compares each run to the best performing run. If a run's metric falls outside the slack factor (10%) from the best, it is terminated. This is checked at each evaluation_interval (every 2 reporting intervals). This saves compute by stopping unpromising trials early.
Lilly Tech Systems