Explore Data & Train Models (25-30%) Intermediate

This domain tests your ability to perform exploratory data analysis, engineer features, run AutoML experiments, and write custom training scripts. It is the second-highest weighted domain and requires both conceptual understanding and hands-on SDK knowledge.

Exploratory Data Analysis (EDA)

Before training any model, you need to understand your data. The exam tests your knowledge of EDA techniques and how to perform them in Azure ML notebooks.

Key EDA Tasks for the Exam

Summary statistics — Mean, median, standard deviation, percentiles using df.describe()
Missing value analysis — Identify nulls with df.isnull().sum(), decide imputation vs. removal
Distribution analysis — Histograms, box plots, skewness detection
Correlation analysis — Pearson/Spearman correlation matrices, multicollinearity detection
Outlier detection — IQR method, Z-score, isolation forests
Class imbalance — Check target variable distribution for classification tasks

# EDA in Azure ML notebook (exam-relevant patterns)
import pandas as pd
import matplotlib.pyplot as plt
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Load data from registered data asset
ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="...",
    resource_group_name="dp100-rg",
    workspace_name="dp100-workspace"
)

# Get data asset URI and load
data_asset = ml_client.data.get("customer-churn-data", version="1")
df = pd.read_csv(data_asset.path)

# Summary statistics
print(df.describe())
print(f"\nShape: {df.shape}")
print(f"\nMissing values:\n{df.isnull().sum()}")

# Check class balance for classification
print(f"\nTarget distribution:\n{df['churn'].value_counts(normalize=True)}")

# Correlation matrix
correlation = df.select_dtypes(include='number').corr()
print(f"\nHigh correlations (>0.8):")
for col in correlation.columns:
    for idx in correlation.index:
        if col != idx and abs(correlation.loc[idx, col]) > 0.8:
            print(f"  {col} <-> {idx}: {correlation.loc[idx, col]:.3f}")

Feature Engineering

The exam tests your understanding of common feature engineering techniques and when to apply them.

Technique	When to Use	Azure ML Approach
One-hot encoding	Categorical features with few unique values	`pd.get_dummies()` or sklearn OneHotEncoder
Label encoding	Ordinal categorical features	sklearn LabelEncoder
Normalization / Scaling	Features with different scales	StandardScaler, MinMaxScaler
Log transform	Highly skewed numerical features	`np.log1p()`
Binning	Converting continuous to categorical	`pd.cut()` or `pd.qcut()`
Feature selection	Reducing dimensionality, removing noise	Variance threshold, mutual information, RFE
Polynomial features	Capturing non-linear relationships	sklearn PolynomialFeatures

AutoML in Azure ML

AutoML automates model selection, hyperparameter tuning, and feature engineering. The exam heavily tests AutoML configuration and understanding output.

# Configure and run AutoML with SDK v2
from azure.ai.ml import automl, Input

# Classification task
classification_job = automl.classification(
    compute="dp100-cluster",
    experiment_name="churn-automl",
    training_data=Input(
        type="mltable",
        path="azureml://datastores/workspaceblobstore/paths/data/churn-mltable/"
    ),
    target_column_name="churn",
    primary_metric="AUC_weighted",
    # Key configuration options for the exam:
    enable_model_explainability=True,    # Generate feature importance
    enable_early_termination=True,       # Stop poor-performing runs
    n_cross_validations=5,               # K-fold cross validation
    max_trials=20,                       # Max models to try
    max_concurrent_trials=4,             # Parallel experiments
    timeout_minutes=60,                  # Total time budget
    # Allowed/blocked models
    allowed_training_algorithms=[
        "LogisticRegression",
        "LightGBM",
        "RandomForest",
        "GradientBoosting",
        "XGBoostClassifier"
    ],
    # Featurization settings
    featurization="auto"                 # auto, off, or custom
)

# Submit the job
returned_job = ml_client.jobs.create_or_update(classification_job)
print(f"Job URL: {returned_job.studio_url}")

Exam Tip: Know the primary metrics for each task type. Classification: AUC_weighted, accuracy, precision_score_weighted. Regression: normalized_root_mean_squared_error, r2_score, spearman_correlation. Forecasting: normalized_root_mean_squared_error, normalized_mean_absolute_error. The exam often asks which metric is most appropriate for a given scenario.

AutoML Task Types

Task	Function	Common Metrics
Classification	`automl.classification()`	AUC_weighted, accuracy, F1_score_weighted
Regression	`automl.regression()`	normalized_RMSE, R2_score, MAE
Forecasting	`automl.forecasting()`	normalized_RMSE, MAPE
Image Classification	`automl.image_classification()`	accuracy, AUC
Object Detection	`automl.image_object_detection()`	mAP (mean Average Precision)
NLP Text Classification	`automl.text_classification()`	accuracy, AUC_weighted

Custom Training Scripts

When AutoML is not sufficient, you write custom training scripts and submit them as command jobs.

# Custom training script: train.py
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

def main():
    # Parse arguments (passed from pipeline/command job)
    parser = argparse.ArgumentParser()
    parser.add_argument("--input-data", type=str, required=True)
    parser.add_argument("--learning-rate", type=float, default=0.1)
    parser.add_argument("--n-estimators", type=int, default=100)
    parser.add_argument("--max-depth", type=int, default=3)
    args = parser.parse_args()

    # Enable MLflow autologging
    mlflow.sklearn.autolog()

    # Load data
    df = pd.read_csv(args.input_data)

    # Split features and target
    X = df.drop("churn", axis=1)
    y = df["churn"]

    # Train/test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # Train model
    model = GradientBoostingClassifier(
        learning_rate=args.learning_rate,
        n_estimators=args.n_estimators,
        max_depth=args.max_depth,
        random_state=42
    )
    model.fit(X_train, y_train)

    # Evaluate
    predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

    # Log metrics (MLflow)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("auc", auc)

    print(f"Accuracy: {accuracy:.4f}")
    print(f"AUC: {auc:.4f}")
    print(classification_report(y_test, predictions))

if __name__ == "__main__":
    main()

# Submit custom training as a command job
from azure.ai.ml import command, Input

# Define command job
job = command(
    code="./src",                          # Local folder with train.py
    command="python train.py --input-data ${{inputs.data}} --learning-rate ${{inputs.lr}} --n-estimators ${{inputs.n_est}}",
    inputs={
        "data": Input(
            type="uri_file",
            path="azureml://datastores/workspaceblobstore/paths/data/churn.csv"
        ),
        "lr": 0.05,
        "n_est": 200
    },
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
    compute="dp100-cluster",
    experiment_name="churn-custom-training",
    display_name="gradient-boosting-v1"
)

returned_job = ml_client.jobs.create_or_update(job)
print(f"Job URL: {returned_job.studio_url}")

Hyperparameter Tuning (Sweep Jobs)

Azure ML sweep jobs automate hyperparameter search. Know the sampling methods and early termination policies for the exam.

# Hyperparameter sweep job
from azure.ai.ml.sweep import Choice, Uniform, BanditPolicy

# Convert command to sweep
sweep_job = job.sweep(
    sampling_algorithm="bayesian",          # random, grid, or bayesian
    primary_metric="auc",
    goal="maximize",
)

# Define search space
sweep_job.set_limits(
    max_total_trials=20,
    max_concurrent_trials=4,
    timeout=3600                            # seconds
)

# Early termination policy
sweep_job.early_termination = BanditPolicy(
    slack_factor=0.1,                       # Allow 10% slack from best
    evaluation_interval=2,                  # Check every 2 trials
    delay_evaluation=5                      # Skip first 5 trials
)

# Search space overrides
sweep_job.inputs.lr = Uniform(min_value=0.001, max_value=0.3)
sweep_job.inputs.n_est = Choice(values=[50, 100, 200, 500])

returned_sweep = ml_client.jobs.create_or_update(sweep_job)

Exam Tip: Know the three sampling algorithms. Grid: tries every combination (exhaustive, use for small search spaces). Random: randomly samples (good balance of exploration and speed). Bayesian: uses prior results to pick next trials (most efficient, but cannot run with early termination on all policies). The exam often asks which sampling method to choose given constraints.

Practice Questions

Question 1: You are building an AutoML classification experiment for fraud detection. The dataset is highly imbalanced (2% fraud, 98% non-fraud). Which primary metric should you select?

A. accuracy
B. AUC_weighted
C. norm_macro_recall
D. precision_score_weighted

Show Answer

B. AUC_weighted. For imbalanced datasets, accuracy is misleading (a model predicting all non-fraud gets 98% accuracy). AUC_weighted evaluates the model's ability to distinguish between classes regardless of threshold and handles class imbalance well. Norm_macro_recall could work but AUC_weighted is the recommended default for imbalanced classification in Azure AutoML.

Question 2: You want AutoML to try only tree-based algorithms for a regression task. Which parameter do you configure?

A. featurization
B. allowed_training_algorithms
C. primary_metric
D. max_trials

Show Answer

B. allowed_training_algorithms. This parameter restricts AutoML to only the specified algorithms (e.g., LightGBM, RandomForest, XGBoostRegressor). Featurization controls data preprocessing. Primary_metric sets the optimization target. Max_trials limits the number of experiments.

Question 3: You are configuring a hyperparameter sweep job. You have a small search space with 3 hyperparameters, each with 3 possible values (27 total combinations). The training budget allows 27 trials. Which sampling algorithm should you use?

A. Random
B. Grid
C. Bayesian
D. Sobol

Show Answer

B. Grid. With only 27 total combinations and a budget for 27 trials, grid search will exhaustively test every combination, guaranteeing you find the best configuration. Random sampling might miss some combinations. Bayesian is more efficient for large search spaces but unnecessary when you can afford exhaustive search.

Question 4: You need to log custom metrics and model artifacts during training. Which framework is natively integrated with Azure ML for experiment tracking?

A. TensorBoard
B. Weights & Biases
C. MLflow
D. Neptune.ai

Show Answer

C. MLflow. Azure ML natively integrates with MLflow for experiment tracking, model logging, and model registry. You can use mlflow.log_metric(), mlflow.log_artifact(), and mlflow.sklearn.autolog() directly in Azure ML training scripts. TensorBoard is supported for visualization but MLflow is the primary tracking framework.

Question 5: You configure a BanditPolicy with slack_factor=0.1 and evaluation_interval=2. What does this policy do?

A. Terminates runs that are 10% slower than the fastest run
B. Terminates runs whose primary metric is more than 10% worse than the best run, checked every 2 intervals
C. Allocates 10% more resources to the top 2 runs
D. Randomly terminates 10% of runs every 2 intervals

Show Answer

B. Terminates runs whose primary metric is more than 10% worse than the best run, checked every 2 intervals. The Bandit policy compares each run to the best performing run. If a run's metric falls outside the slack factor (10%) from the best, it is terminated. This is checked at each evaluation_interval (every 2 reporting intervals). This saves compute by stopping unpromising trials early.

← Design ML Solutions Prepare Data for Modeling →