Intermediate

Model Training & Tuning

Master MLflow experiment tracking, Databricks AutoML, Hyperopt for hyperparameter tuning, and distributed training patterns — covering approximately 30% of the exam (Experimentation domain).

MLflow Experiment Tracking

MLflow is the backbone of ML experimentation on Databricks. The exam heavily tests your knowledge of the MLflow Tracking API, including runs, parameters, metrics, artifacts, and experiment organization.

Core MLflow Concepts

  • Experiment — A named container for related runs (e.g., one experiment per ML project or model)
  • Run — A single execution of ML code that logs parameters, metrics, and artifacts
  • Parameters — Input settings (e.g., learning rate, number of trees)
  • Metrics — Output measurements (e.g., accuracy, RMSE, F1 score)
  • Artifacts — Output files (e.g., model files, plots, data samples)
  • Tags — Key-value metadata for organizing and searching runs

Logging an Experiment Run

import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Set the experiment (creates if not exists)
mlflow.set_experiment("/Users/team/churn-prediction")

with mlflow.start_run(run_name="rf-baseline") as run:
    # Log parameters
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    mlflow.log_param("min_samples_split", 5)

    # Train model
    rf = RandomForestClassifier(n_estimators=100, max_depth=10)
    rf.fit(X_train, y_train)

    # Log metrics
    y_pred = rf.predict(X_test)
    mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("f1_score", f1_score(y_test, y_pred))

    # Log the model
    mlflow.sklearn.log_model(rf, "model")

    # Log artifacts
    mlflow.log_artifact("confusion_matrix.png")

    print(f"Run ID: {run.info.run_id}")
💡
Exam tip: Know the difference between mlflow.log_param() (single value) and mlflow.log_params() (dictionary of values). Same for mlflow.log_metric() vs mlflow.log_metrics(). Also know that log_metric() accepts an optional step parameter for tracking metrics over training iterations.

MLflow Autologging

Databricks supports automatic logging for popular frameworks:

# Enable autologging for scikit-learn
mlflow.sklearn.autolog()

# Enable autologging for all supported frameworks
mlflow.autolog()

# Now any model training automatically logs params, metrics, and artifacts
rf = RandomForestClassifier(n_estimators=200)
rf.fit(X_train, y_train)  # automatically logged!
Autolog scope: When mlflow.autolog() is enabled, it creates a new run for each fit() call. If you need to log additional custom metrics or artifacts, wrap the training in mlflow.start_run() to control the run context. The exam tests this interaction.

Databricks AutoML

Databricks AutoML automatically prepares data, trains multiple models, and generates notebooks with the best-performing approaches. It is a glass-box solution — all code is visible and editable.

Using AutoML

from databricks import automl

# Classification
summary = automl.classify(
    dataset=train_df,
    target_col="churned",
    primary_metric="f1",
    timeout_minutes=30,
    max_trials=50
)

# Access the best model
best_model = summary.best_trial
print(f"Best F1: {best_model.metrics['test_f1_score']}")
print(f"Best run: {best_model.mlflow_run_id}")

AutoML Key Features

  • Glass-box approach — Generates editable notebooks for the best trials, not black-box models
  • Automatic feature engineering — Handles missing values, one-hot encoding, and feature selection
  • Multiple algorithms — Tests XGBoost, LightGBM, sklearn, and other frameworks automatically
  • MLflow integration — All trials are logged to MLflow for comparison
  • Data exploration notebook — Generates a summary statistics and visualization notebook
💡
Exam tip: AutoML supports three problem types: automl.classify(), automl.regress(), and automl.forecast(). Know which primary_metric options are available for each (e.g., "f1" for classification, "rmse" for regression).

Hyperopt for Hyperparameter Tuning

Hyperopt is the recommended library for distributed hyperparameter optimization on Databricks. It uses Bayesian optimization (Tree of Parzen Estimators) to efficiently search the parameter space.

Core Hyperopt Components

  • fmin() — The main function that minimizes an objective function
  • hp.choice() — Choose from a list of discrete options
  • hp.uniform() — Uniform distribution over a continuous range
  • hp.loguniform() — Log-uniform distribution (for learning rates)
  • hp.quniform() — Quantized uniform (for integer parameters)
  • SparkTrials — Distributes trials across Spark workers
  • Trials — Tracks results on a single machine

Distributed Hyperopt Example

from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK
import numpy as np

def objective(params):
    with mlflow.start_run(nested=True):
        rf = RandomForestClassifier(
            n_estimators=int(params["n_estimators"]),
            max_depth=int(params["max_depth"]),
            min_samples_split=int(params["min_samples_split"])
        )
        rf.fit(X_train, y_train)
        accuracy = accuracy_score(y_test, rf.predict(X_test))

        mlflow.log_params(params)
        mlflow.log_metric("accuracy", accuracy)

        # Hyperopt MINIMIZES, so return negative accuracy
        return {"loss": -accuracy, "status": STATUS_OK}

search_space = {
    "n_estimators": hp.quniform("n_estimators", 50, 500, 50),
    "max_depth": hp.quniform("max_depth", 3, 20, 1),
    "min_samples_split": hp.quniform("min_samples_split", 2, 10, 1)
}

# Distribute across Spark workers
spark_trials = SparkTrials(parallelism=4)

with mlflow.start_run(run_name="hyperopt-tuning"):
    best_params = fmin(
        fn=objective,
        space=search_space,
        algo=tpe.suggest,
        max_evals=50,
        trials=spark_trials
    )
Critical exam concept: Hyperopt fmin() always minimizes the loss. If you want to maximize accuracy, return -accuracy as the loss. This is one of the most commonly tested Hyperopt concepts. Also note that SparkTrials distributes single-machine models across workers, while Trials runs everything on the driver.

Distributed Training

Databricks supports several approaches to distributed model training:

Spark ML (Native Distributed)

Spark ML algorithms are natively distributed. They use the Spark cluster for both data processing and training:

from pyspark.ml.classification import RandomForestClassifier as SparkRF

spark_rf = SparkRF(
    featuresCol="features",
    labelCol="label",
    numTrees=100,
    maxDepth=10
)
model = spark_rf.fit(train_spark_df)  # distributed across workers

Pandas API on Spark (pandas UDFs)

Train single-node models on partitions of data:

import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf("double")
def predict_udf(features: pd.Series) -> pd.Series:
    # Model is broadcast to each worker
    return pd.Series(model.predict(features.tolist()))

# Apply predictions in parallel across partitions
predictions_df = spark_df.withColumn("prediction", predict_udf("features"))

Horovod / TorchDistributor

For deep learning, Databricks integrates with distributed training frameworks:

from databricks.connect import DatabricksSession
from pyspark.ml.torch.distributor import TorchDistributor

def train_fn():
    import torch
    # Training logic here
    return model

distributor = TorchDistributor(
    num_processes=4,
    local_mode=False,
    use_gpu=True
)
trained_model = distributor.run(train_fn)
💡
Exam tip: Know when to use each approach. SparkTrials = distribute many small model trainings (hyperparameter tuning). Spark ML = natively distributed algorithms on Spark DataFrames. TorchDistributor/Horovod = distribute a single large model training across GPUs.

Practice Questions


Question 1 — MLflow Tracking

Q1
A data scientist enables mlflow.autolog() and then calls model.fit(X_train, y_train) inside a with mlflow.start_run(): block. They also want to log a custom metric called "business_value". What happens?

A) Autolog creates a separate run; the custom metric is logged to the parent run
B) Autolog logs to the active run context; the custom metric can be logged in the same run
C) An error occurs because autolog and manual logging cannot be combined
D) The custom metric overwrites the autolog metrics

Answer: B — When mlflow.start_run() is active, autolog logs to that active run rather than creating a new one. The data scientist can then also call mlflow.log_metric("business_value", value) within the same run context. Autolog and manual logging are fully compatible within the same run.

Question 2 — Hyperopt

Q2
A team is using Hyperopt with SparkTrials(parallelism=8) on a cluster with 4 workers. They run fmin(..., max_evals=100). How are the 100 trials distributed?

A) 25 trials per worker, all running simultaneously
B) Up to 8 trials run in parallel across the 4 workers, with new trials starting as previous ones complete
C) 100 trials run sequentially on the driver node
D) An error because parallelism exceeds worker count

Answer: BSparkTrials distributes individual trial executions as Spark tasks. With parallelism=8, up to 8 trials run concurrently (each worker can run multiple tasks). As each trial completes, a new one is scheduled. The TPE algorithm uses completed trial results to inform the next trial's parameters, so higher parallelism can slightly reduce optimization effectiveness.

Question 3 — AutoML

Q3
A team runs automl.classify(dataset=df, target_col="label", timeout_minutes=60). Which of the following is TRUE about the output?

A) A single optimized model is returned with no visibility into the training process
B) Editable source notebooks are generated for the best trials, and all runs are logged to MLflow
C) The model is automatically deployed to a serving endpoint
D) Only one algorithm (XGBoost) is tested with different hyperparameters

Answer: B — Databricks AutoML is a glass-box solution. It generates editable Python notebooks for each of the best trials, logs all experiments to MLflow, and creates a data exploration notebook. Multiple algorithms (XGBoost, LightGBM, sklearn) are tested. The model is NOT automatically deployed.

Question 4 — Distributed Training

Q4
A data scientist needs to tune 200 hyperparameter combinations of a scikit-learn model on a Databricks cluster with 10 workers. Which approach is most efficient?

A) Use a for loop to train all 200 models on the driver
B) Use Hyperopt with SparkTrials to distribute trials across workers
C) Convert the scikit-learn model to Spark ML and use CrossValidator
D) Use TorchDistributor to distribute the training

Answer: BSparkTrials distributes single-machine model training (like scikit-learn) across Spark workers. Each worker trains one model independently. Option A wastes the cluster. Option C requires rewriting the model in Spark ML. Option D is for distributed deep learning, not hyperparameter search of scikit-learn models.

Question 5 — MLflow Logging

Q5
What is the correct way to log training loss at each epoch in MLflow?

A) mlflow.log_metric("loss", loss_value) called once after training
B) mlflow.log_metric("loss", loss_value, step=epoch) called at each epoch
C) mlflow.log_param("loss_epoch_" + str(epoch), loss_value) at each epoch
D) mlflow.log_metrics({"loss": [loss_1, loss_2, ...]}) after training

Answer: B — The step parameter in mlflow.log_metric() enables tracking metric values over iterations (epochs, steps, etc.). This creates a metric history that can be visualized as a line chart in the MLflow UI. Option A only stores the final value. Option C uses parameters incorrectly. Option D does not accept lists.