Intermediate

MLflow Projects & Recipes

MLflow Projects provide a standard format for packaging reproducible ML code, while MLflow Recipes (formerly Pipelines) offer opinionated templates for common ML tasks. Together these topics make up ~20% of the certification exam.

MLproject File Structure

An MLflow Project is a directory or Git repository with an MLproject file that defines the project name, environment, and entry points. This YAML file is the key to reproducibility.

# MLproject file (YAML format - must be named exactly "MLproject")

name: my-ml-project

# Environment specification (choose ONE)
conda_env: conda.yaml          # Option 1: Conda environment
# docker_env:                   # Option 2: Docker environment
#   image: my-ml-image:latest
#   volumes: ["/data:/data"]
#   environment: ["MLFLOW_TRACKING_URI"]
# python_env: python_env.yaml   # Option 3: virtualenv (lightweight)

# Entry points define the commands that can be run
entry_points:
  main:
    parameters:
      n_estimators: {type: int, default: 100}
      max_depth: {type: int, default: 5}
      data_path: {type: str, default: "data/train.csv"}
    command: "python train.py --n_estimators {n_estimators} --max_depth {max_depth} --data {data_path}"

  validate:
    parameters:
      model_uri: {type: str}
      test_data: {type: str, default: "data/test.csv"}
    command: "python validate.py --model-uri {model_uri} --test-data {test_data}"

  preprocess:
    command: "python preprocess.py"
💡
Exam tip: The MLproject file must be named exactly MLproject (case-sensitive, no extension). It uses YAML format. Know the three environment options: conda_env, docker_env, and python_env.

Conda Environment Specification

The conda.yaml file defines the exact dependencies for reproducing the project environment.

# conda.yaml - Conda environment specification

name: my-ml-env
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.10
  - pip
  - numpy=1.24
  - scikit-learn=1.3
  - pandas=2.0
  - pip:
    - mlflow==2.9
    - xgboost==2.0

# This file is referenced in the MLproject file:
# conda_env: conda.yaml
#
# When running the project, MLflow creates this conda
# environment automatically before executing the entry point

Docker Environment

For more complex environments or when you need system-level dependencies, use a Docker environment specification.

# MLproject file with Docker environment

name: my-docker-project

docker_env:
  image: my-registry/ml-image:v1.0
  volumes:
    - "/local/data:/container/data"
  environment:
    - "MLFLOW_TRACKING_URI"
    - ["AWS_ACCESS_KEY_ID", "my-key-id"]

entry_points:
  main:
    parameters:
      epochs: {type: int, default: 10}
    command: "python train.py --epochs {epochs}"

# Docker environment fields:
# image (required) - Docker image name with optional tag
# volumes (optional) - list of volume mounts
# environment (optional) - env vars to pass to container
#   - "VAR_NAME" passes the host's value
#   - ["VAR_NAME", "value"] sets a specific value

Running Projects

You can run MLflow Projects from local directories, Git repos, or remote URIs using the CLI or Python API.

# CLI: Run a local project
# mlflow run . -P n_estimators=200 -P max_depth=10

# CLI: Run from a Git repo
# mlflow run https://github.com/user/ml-project -P n_estimators=200

# CLI: Run a specific entry point
# mlflow run . -e validate -P model_uri=runs:/abc123/model

# CLI: Run a specific Git branch or commit
# mlflow run https://github.com/user/project --version branch-name

# Python API: Run a local project
import mlflow

mlflow.projects.run(
    uri=".",
    entry_point="main",
    parameters={"n_estimators": 200, "max_depth": 10},
    experiment_name="my-experiment"
)

# Python API: Run from a Git repo
mlflow.projects.run(
    uri="https://github.com/user/ml-project",
    entry_point="main",
    parameters={"n_estimators": 200},
    version="v1.0"  # Git tag, branch, or commit hash
)

# Python API: Run with a specific backend
mlflow.projects.run(
    uri=".",
    backend="local",       # or "databricks", "kubernetes"
    parameters={"epochs": 50}
)

# EXAM TIP: Know both CLI and Python API for running projects
# -P flag sets parameters in CLI
# --version flag specifies Git branch/tag/commit
# -e flag specifies entry point (defaults to "main")

MLflow Recipes (formerly Pipelines)

MLflow Recipes provide opinionated, modular templates for common ML workflows. They reduce boilerplate and enforce best practices.

# MLflow Recipes - Key Concepts

# Recipes are pre-built ML workflow templates
# The most common is the "regression" recipe

# Directory structure for an MLflow Recipe:
# my-recipe/
# ├── recipe.yaml          # Recipe configuration
# ├── profiles/
# │   ├── local.yaml       # Local profile settings
# │   └── databricks.yaml  # Databricks profile settings
# ├── steps/
# │   ├── ingest.py        # Data ingestion step
# │   ├── split.py         # Train/test split step
# │   ├── transform.py     # Feature engineering step
# │   ├── train.py         # Model training step
# │   ├── evaluate.py      # Model evaluation step
# │   └── register.py      # Model registration step
# └── notebooks/
#     └── jupyter.ipynb     # Interactive development

# recipe.yaml example
recipe_config = """
recipe: "regression/v1"
target_col: "price"
positive_class: null
primary_metric: "root_mean_squared_error"
steps:
  ingest:
    using: "custom"
    location: "data/housing.csv"
  split:
    split_ratios: [0.75, 0.125, 0.125]
  transform:
    using: "custom"
  train:
    using: "custom"
  evaluate:
    validation_criteria:
      - metric: root_mean_squared_error
        threshold: 10000
  register:
    model_name: "housing-price-model"
    allow_non_validated_model: false
"""

# Profile (profiles/local.yaml) example
profile_config = """
experiment:
  name: "housing-price-experiment"
  tracking_uri: "sqlite:///mlflow.db"
INGEST_DATA_LOCATION: "./data/housing.csv"
"""

# Running a recipe
# from mlflow.recipes import Recipe
# r = Recipe(profile="local")
# r.run()                    # Run all steps
# r.run("train")             # Run up to and including train step
# r.inspect()                # View the recipe DAG
# r.inspect("train")         # Inspect a specific step

# EXAM TIP: Recipes are formerly called "Pipelines"
# Know the standard steps: ingest, split, transform, train, evaluate, register
# Profiles allow different configs for local vs. Databricks

Practice Questions

Test your understanding of MLflow Projects and Recipes with these exam-style questions.

Question 1

What must the MLflow project configuration file be named?

A) mlflow.yaml

B) MLproject

C) project.yaml

D) MLproject.yaml

Show Answer

B) MLproject — The file must be named exactly MLproject (case-sensitive, no file extension). It uses YAML format despite not having a .yaml extension.

Question 2

Which three environment types are supported in an MLproject file?

A) conda_env, docker_env, pip_env

B) conda_env, docker_env, python_env

C) conda_env, container_env, venv_env

D) pip_env, docker_env, system_env

Show Answer

B) conda_env, docker_env, python_env — These are the three supported environment types. conda_env references a conda.yaml file, docker_env specifies a Docker image, and python_env uses virtualenv for a lightweight alternative.

Question 3

How do you run a specific entry point named "validate" from the CLI?

A) mlflow run . --entry validate

B) mlflow run . -e validate

C) mlflow run validate .

D) mlflow projects run . validate

Show Answer

B) mlflow run . -e validate — The -e flag specifies the entry point. If omitted, it defaults to the "main" entry point. Parameters are passed with -P key=value.

Question 4

In MLflow Recipes, what is the purpose of a "profile"?

A) To define the model architecture

B) To provide environment-specific configuration (local vs. Databricks)

C) To store user authentication credentials

D) To define the recipe steps and their order

Show Answer

B) — Profiles provide environment-specific configuration. For example, profiles/local.yaml might use a SQLite backend and local file paths, while profiles/databricks.yaml would use Databricks-specific settings. The recipe steps are defined in recipe.yaml.

Question 5

What are the standard steps in an MLflow Recipe (in order)?

A) load, preprocess, train, test, deploy

B) ingest, split, transform, train, evaluate, register

C) fetch, clean, feature, model, score, publish

D) read, prepare, fit, predict, save

Show Answer

B) ingest, split, transform, train, evaluate, register — These are the six standard steps in an MLflow Recipe. Each step has a corresponding Python file in the steps/ directory.

Question 6

How do you run an MLflow Project from a specific Git tag using the Python API?

A) mlflow.projects.run(uri="https://github.com/...", tag="v1.0")

B) mlflow.projects.run(uri="https://github.com/...", version="v1.0")

C) mlflow.projects.run(uri="https://github.com/...#v1.0")

D) mlflow.projects.run(uri="https://github.com/.../v1.0")

Show Answer

B) — The version parameter specifies a Git tag, branch name, or commit hash. In the CLI, the equivalent is the --version flag.

Key Takeaways

💡
  • MLproject files must be named exactly "MLproject" (case-sensitive, no extension) and use YAML format
  • Three environment types: conda_env, docker_env, and python_env — choose one per project
  • Entry points define runnable commands with typed parameters and default values
  • Projects can be run from local directories, Git repos, or remote URIs using CLI or Python API
  • MLflow Recipes (formerly Pipelines) are opinionated templates with 6 standard steps
  • Profiles allow different configurations for local vs. cloud environments