Model Training & Tuning
Master MLflow experiment tracking, Databricks AutoML, Hyperopt for hyperparameter tuning, and distributed training patterns — covering approximately 30% of the exam (Experimentation domain).
MLflow Experiment Tracking
MLflow is the backbone of ML experimentation on Databricks. The exam heavily tests your knowledge of the MLflow Tracking API, including runs, parameters, metrics, artifacts, and experiment organization.
Core MLflow Concepts
- Experiment — A named container for related runs (e.g., one experiment per ML project or model)
- Run — A single execution of ML code that logs parameters, metrics, and artifacts
- Parameters — Input settings (e.g., learning rate, number of trees)
- Metrics — Output measurements (e.g., accuracy, RMSE, F1 score)
- Artifacts — Output files (e.g., model files, plots, data samples)
- Tags — Key-value metadata for organizing and searching runs
Logging an Experiment Run
import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
# Set the experiment (creates if not exists)
mlflow.set_experiment("/Users/team/churn-prediction")
with mlflow.start_run(run_name="rf-baseline") as run:
# Log parameters
mlflow.log_param("n_estimators", 100)
mlflow.log_param("max_depth", 10)
mlflow.log_param("min_samples_split", 5)
# Train model
rf = RandomForestClassifier(n_estimators=100, max_depth=10)
rf.fit(X_train, y_train)
# Log metrics
y_pred = rf.predict(X_test)
mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
mlflow.log_metric("f1_score", f1_score(y_test, y_pred))
# Log the model
mlflow.sklearn.log_model(rf, "model")
# Log artifacts
mlflow.log_artifact("confusion_matrix.png")
print(f"Run ID: {run.info.run_id}")
mlflow.log_param() (single value) and mlflow.log_params() (dictionary of values). Same for mlflow.log_metric() vs mlflow.log_metrics(). Also know that log_metric() accepts an optional step parameter for tracking metrics over training iterations.MLflow Autologging
Databricks supports automatic logging for popular frameworks:
# Enable autologging for scikit-learn
mlflow.sklearn.autolog()
# Enable autologging for all supported frameworks
mlflow.autolog()
# Now any model training automatically logs params, metrics, and artifacts
rf = RandomForestClassifier(n_estimators=200)
rf.fit(X_train, y_train) # automatically logged!
mlflow.autolog() is enabled, it creates a new run for each fit() call. If you need to log additional custom metrics or artifacts, wrap the training in mlflow.start_run() to control the run context. The exam tests this interaction.Databricks AutoML
Databricks AutoML automatically prepares data, trains multiple models, and generates notebooks with the best-performing approaches. It is a glass-box solution — all code is visible and editable.
Using AutoML
from databricks import automl
# Classification
summary = automl.classify(
dataset=train_df,
target_col="churned",
primary_metric="f1",
timeout_minutes=30,
max_trials=50
)
# Access the best model
best_model = summary.best_trial
print(f"Best F1: {best_model.metrics['test_f1_score']}")
print(f"Best run: {best_model.mlflow_run_id}")
AutoML Key Features
- Glass-box approach — Generates editable notebooks for the best trials, not black-box models
- Automatic feature engineering — Handles missing values, one-hot encoding, and feature selection
- Multiple algorithms — Tests XGBoost, LightGBM, sklearn, and other frameworks automatically
- MLflow integration — All trials are logged to MLflow for comparison
- Data exploration notebook — Generates a summary statistics and visualization notebook
automl.classify(), automl.regress(), and automl.forecast(). Know which primary_metric options are available for each (e.g., "f1" for classification, "rmse" for regression).Hyperopt for Hyperparameter Tuning
Hyperopt is the recommended library for distributed hyperparameter optimization on Databricks. It uses Bayesian optimization (Tree of Parzen Estimators) to efficiently search the parameter space.
Core Hyperopt Components
fmin()— The main function that minimizes an objective functionhp.choice()— Choose from a list of discrete optionshp.uniform()— Uniform distribution over a continuous rangehp.loguniform()— Log-uniform distribution (for learning rates)hp.quniform()— Quantized uniform (for integer parameters)SparkTrials— Distributes trials across Spark workersTrials— Tracks results on a single machine
Distributed Hyperopt Example
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK
import numpy as np
def objective(params):
with mlflow.start_run(nested=True):
rf = RandomForestClassifier(
n_estimators=int(params["n_estimators"]),
max_depth=int(params["max_depth"]),
min_samples_split=int(params["min_samples_split"])
)
rf.fit(X_train, y_train)
accuracy = accuracy_score(y_test, rf.predict(X_test))
mlflow.log_params(params)
mlflow.log_metric("accuracy", accuracy)
# Hyperopt MINIMIZES, so return negative accuracy
return {"loss": -accuracy, "status": STATUS_OK}
search_space = {
"n_estimators": hp.quniform("n_estimators", 50, 500, 50),
"max_depth": hp.quniform("max_depth", 3, 20, 1),
"min_samples_split": hp.quniform("min_samples_split", 2, 10, 1)
}
# Distribute across Spark workers
spark_trials = SparkTrials(parallelism=4)
with mlflow.start_run(run_name="hyperopt-tuning"):
best_params = fmin(
fn=objective,
space=search_space,
algo=tpe.suggest,
max_evals=50,
trials=spark_trials
)
fmin() always minimizes the loss. If you want to maximize accuracy, return -accuracy as the loss. This is one of the most commonly tested Hyperopt concepts. Also note that SparkTrials distributes single-machine models across workers, while Trials runs everything on the driver.Distributed Training
Databricks supports several approaches to distributed model training:
Spark ML (Native Distributed)
Spark ML algorithms are natively distributed. They use the Spark cluster for both data processing and training:
from pyspark.ml.classification import RandomForestClassifier as SparkRF
spark_rf = SparkRF(
featuresCol="features",
labelCol="label",
numTrees=100,
maxDepth=10
)
model = spark_rf.fit(train_spark_df) # distributed across workers
Pandas API on Spark (pandas UDFs)
Train single-node models on partitions of data:
import pandas as pd
from pyspark.sql.functions import pandas_udf
@pandas_udf("double")
def predict_udf(features: pd.Series) -> pd.Series:
# Model is broadcast to each worker
return pd.Series(model.predict(features.tolist()))
# Apply predictions in parallel across partitions
predictions_df = spark_df.withColumn("prediction", predict_udf("features"))
Horovod / TorchDistributor
For deep learning, Databricks integrates with distributed training frameworks:
from databricks.connect import DatabricksSession
from pyspark.ml.torch.distributor import TorchDistributor
def train_fn():
import torch
# Training logic here
return model
distributor = TorchDistributor(
num_processes=4,
local_mode=False,
use_gpu=True
)
trained_model = distributor.run(train_fn)
Practice Questions
Question 1 — MLflow Tracking
mlflow.autolog() and then calls model.fit(X_train, y_train) inside a with mlflow.start_run(): block. They also want to log a custom metric called "business_value". What happens?A) Autolog creates a separate run; the custom metric is logged to the parent run
B) Autolog logs to the active run context; the custom metric can be logged in the same run
C) An error occurs because autolog and manual logging cannot be combined
D) The custom metric overwrites the autolog metrics
Answer: B — When
mlflow.start_run() is active, autolog logs to that active run rather than creating a new one. The data scientist can then also call mlflow.log_metric("business_value", value) within the same run context. Autolog and manual logging are fully compatible within the same run.
Question 2 — Hyperopt
SparkTrials(parallelism=8) on a cluster with 4 workers. They run fmin(..., max_evals=100). How are the 100 trials distributed?A) 25 trials per worker, all running simultaneously
B) Up to 8 trials run in parallel across the 4 workers, with new trials starting as previous ones complete
C) 100 trials run sequentially on the driver node
D) An error because parallelism exceeds worker count
Answer: B —
SparkTrials distributes individual trial executions as Spark tasks. With parallelism=8, up to 8 trials run concurrently (each worker can run multiple tasks). As each trial completes, a new one is scheduled. The TPE algorithm uses completed trial results to inform the next trial's parameters, so higher parallelism can slightly reduce optimization effectiveness.
Question 3 — AutoML
automl.classify(dataset=df, target_col="label", timeout_minutes=60). Which of the following is TRUE about the output?A) A single optimized model is returned with no visibility into the training process
B) Editable source notebooks are generated for the best trials, and all runs are logged to MLflow
C) The model is automatically deployed to a serving endpoint
D) Only one algorithm (XGBoost) is tested with different hyperparameters
Answer: B — Databricks AutoML is a glass-box solution. It generates editable Python notebooks for each of the best trials, logs all experiments to MLflow, and creates a data exploration notebook. Multiple algorithms (XGBoost, LightGBM, sklearn) are tested. The model is NOT automatically deployed.
Question 4 — Distributed Training
A) Use a for loop to train all 200 models on the driver
B) Use Hyperopt with
SparkTrials to distribute trials across workersC) Convert the scikit-learn model to Spark ML and use
CrossValidatorD) Use
TorchDistributor to distribute the trainingAnswer: B —
SparkTrials distributes single-machine model training (like scikit-learn) across Spark workers. Each worker trains one model independently. Option A wastes the cluster. Option C requires rewriting the model in Spark ML. Option D is for distributed deep learning, not hyperparameter search of scikit-learn models.
Question 5 — MLflow Logging
A)
mlflow.log_metric("loss", loss_value) called once after trainingB)
mlflow.log_metric("loss", loss_value, step=epoch) called at each epochC)
mlflow.log_param("loss_epoch_" + str(epoch), loss_value) at each epochD)
mlflow.log_metrics({"loss": [loss_1, loss_2, ...]}) after trainingAnswer: B — The
step parameter in mlflow.log_metric() enables tracking metric values over iterations (epochs, steps, etc.). This creates a metric history that can be visualized as a line chart in the MLflow UI. Option A only stores the final value. Option C uses parameters incorrectly. Option D does not accept lists.