MLflow Projects & Recipes
MLflow Projects provide a standard format for packaging reproducible ML code, while MLflow Recipes (formerly Pipelines) offer opinionated templates for common ML tasks. Together these topics make up ~20% of the certification exam.
MLproject File Structure
An MLflow Project is a directory or Git repository with an MLproject file that defines the project name, environment, and entry points. This YAML file is the key to reproducibility.
# MLproject file (YAML format - must be named exactly "MLproject")
name: my-ml-project
# Environment specification (choose ONE)
conda_env: conda.yaml # Option 1: Conda environment
# docker_env: # Option 2: Docker environment
# image: my-ml-image:latest
# volumes: ["/data:/data"]
# environment: ["MLFLOW_TRACKING_URI"]
# python_env: python_env.yaml # Option 3: virtualenv (lightweight)
# Entry points define the commands that can be run
entry_points:
main:
parameters:
n_estimators: {type: int, default: 100}
max_depth: {type: int, default: 5}
data_path: {type: str, default: "data/train.csv"}
command: "python train.py --n_estimators {n_estimators} --max_depth {max_depth} --data {data_path}"
validate:
parameters:
model_uri: {type: str}
test_data: {type: str, default: "data/test.csv"}
command: "python validate.py --model-uri {model_uri} --test-data {test_data}"
preprocess:
command: "python preprocess.py"
MLproject (case-sensitive, no extension). It uses YAML format. Know the three environment options: conda_env, docker_env, and python_env.Conda Environment Specification
The conda.yaml file defines the exact dependencies for reproducing the project environment.
# conda.yaml - Conda environment specification
name: my-ml-env
channels:
- conda-forge
- defaults
dependencies:
- python=3.10
- pip
- numpy=1.24
- scikit-learn=1.3
- pandas=2.0
- pip:
- mlflow==2.9
- xgboost==2.0
# This file is referenced in the MLproject file:
# conda_env: conda.yaml
#
# When running the project, MLflow creates this conda
# environment automatically before executing the entry point
Docker Environment
For more complex environments or when you need system-level dependencies, use a Docker environment specification.
# MLproject file with Docker environment
name: my-docker-project
docker_env:
image: my-registry/ml-image:v1.0
volumes:
- "/local/data:/container/data"
environment:
- "MLFLOW_TRACKING_URI"
- ["AWS_ACCESS_KEY_ID", "my-key-id"]
entry_points:
main:
parameters:
epochs: {type: int, default: 10}
command: "python train.py --epochs {epochs}"
# Docker environment fields:
# image (required) - Docker image name with optional tag
# volumes (optional) - list of volume mounts
# environment (optional) - env vars to pass to container
# - "VAR_NAME" passes the host's value
# - ["VAR_NAME", "value"] sets a specific value
Running Projects
You can run MLflow Projects from local directories, Git repos, or remote URIs using the CLI or Python API.
# CLI: Run a local project
# mlflow run . -P n_estimators=200 -P max_depth=10
# CLI: Run from a Git repo
# mlflow run https://github.com/user/ml-project -P n_estimators=200
# CLI: Run a specific entry point
# mlflow run . -e validate -P model_uri=runs:/abc123/model
# CLI: Run a specific Git branch or commit
# mlflow run https://github.com/user/project --version branch-name
# Python API: Run a local project
import mlflow
mlflow.projects.run(
uri=".",
entry_point="main",
parameters={"n_estimators": 200, "max_depth": 10},
experiment_name="my-experiment"
)
# Python API: Run from a Git repo
mlflow.projects.run(
uri="https://github.com/user/ml-project",
entry_point="main",
parameters={"n_estimators": 200},
version="v1.0" # Git tag, branch, or commit hash
)
# Python API: Run with a specific backend
mlflow.projects.run(
uri=".",
backend="local", # or "databricks", "kubernetes"
parameters={"epochs": 50}
)
# EXAM TIP: Know both CLI and Python API for running projects
# -P flag sets parameters in CLI
# --version flag specifies Git branch/tag/commit
# -e flag specifies entry point (defaults to "main")
MLflow Recipes (formerly Pipelines)
MLflow Recipes provide opinionated, modular templates for common ML workflows. They reduce boilerplate and enforce best practices.
# MLflow Recipes - Key Concepts
# Recipes are pre-built ML workflow templates
# The most common is the "regression" recipe
# Directory structure for an MLflow Recipe:
# my-recipe/
# ├── recipe.yaml # Recipe configuration
# ├── profiles/
# │ ├── local.yaml # Local profile settings
# │ └── databricks.yaml # Databricks profile settings
# ├── steps/
# │ ├── ingest.py # Data ingestion step
# │ ├── split.py # Train/test split step
# │ ├── transform.py # Feature engineering step
# │ ├── train.py # Model training step
# │ ├── evaluate.py # Model evaluation step
# │ └── register.py # Model registration step
# └── notebooks/
# └── jupyter.ipynb # Interactive development
# recipe.yaml example
recipe_config = """
recipe: "regression/v1"
target_col: "price"
positive_class: null
primary_metric: "root_mean_squared_error"
steps:
ingest:
using: "custom"
location: "data/housing.csv"
split:
split_ratios: [0.75, 0.125, 0.125]
transform:
using: "custom"
train:
using: "custom"
evaluate:
validation_criteria:
- metric: root_mean_squared_error
threshold: 10000
register:
model_name: "housing-price-model"
allow_non_validated_model: false
"""
# Profile (profiles/local.yaml) example
profile_config = """
experiment:
name: "housing-price-experiment"
tracking_uri: "sqlite:///mlflow.db"
INGEST_DATA_LOCATION: "./data/housing.csv"
"""
# Running a recipe
# from mlflow.recipes import Recipe
# r = Recipe(profile="local")
# r.run() # Run all steps
# r.run("train") # Run up to and including train step
# r.inspect() # View the recipe DAG
# r.inspect("train") # Inspect a specific step
# EXAM TIP: Recipes are formerly called "Pipelines"
# Know the standard steps: ingest, split, transform, train, evaluate, register
# Profiles allow different configs for local vs. Databricks
Practice Questions
Test your understanding of MLflow Projects and Recipes with these exam-style questions.
Question 1
What must the MLflow project configuration file be named?
A) mlflow.yaml
B) MLproject
C) project.yaml
D) MLproject.yaml
Show Answer
B) MLproject — The file must be named exactly MLproject (case-sensitive, no file extension). It uses YAML format despite not having a .yaml extension.
Question 2
Which three environment types are supported in an MLproject file?
A) conda_env, docker_env, pip_env
B) conda_env, docker_env, python_env
C) conda_env, container_env, venv_env
D) pip_env, docker_env, system_env
Show Answer
B) conda_env, docker_env, python_env — These are the three supported environment types. conda_env references a conda.yaml file, docker_env specifies a Docker image, and python_env uses virtualenv for a lightweight alternative.
Question 3
How do you run a specific entry point named "validate" from the CLI?
A) mlflow run . --entry validate
B) mlflow run . -e validate
C) mlflow run validate .
D) mlflow projects run . validate
Show Answer
B) mlflow run . -e validate — The -e flag specifies the entry point. If omitted, it defaults to the "main" entry point. Parameters are passed with -P key=value.
Question 4
In MLflow Recipes, what is the purpose of a "profile"?
A) To define the model architecture
B) To provide environment-specific configuration (local vs. Databricks)
C) To store user authentication credentials
D) To define the recipe steps and their order
Show Answer
B) — Profiles provide environment-specific configuration. For example, profiles/local.yaml might use a SQLite backend and local file paths, while profiles/databricks.yaml would use Databricks-specific settings. The recipe steps are defined in recipe.yaml.
Question 5
What are the standard steps in an MLflow Recipe (in order)?
A) load, preprocess, train, test, deploy
B) ingest, split, transform, train, evaluate, register
C) fetch, clean, feature, model, score, publish
D) read, prepare, fit, predict, save
Show Answer
B) ingest, split, transform, train, evaluate, register — These are the six standard steps in an MLflow Recipe. Each step has a corresponding Python file in the steps/ directory.
Question 6
How do you run an MLflow Project from a specific Git tag using the Python API?
A) mlflow.projects.run(uri="https://github.com/...", tag="v1.0")
B) mlflow.projects.run(uri="https://github.com/...", version="v1.0")
C) mlflow.projects.run(uri="https://github.com/...#v1.0")
D) mlflow.projects.run(uri="https://github.com/.../v1.0")
Show Answer
B) — The version parameter specifies a Git tag, branch name, or commit hash. In the CLI, the equivalent is the --version flag.
Key Takeaways
- MLproject files must be named exactly "MLproject" (case-sensitive, no extension) and use YAML format
- Three environment types: conda_env, docker_env, and python_env — choose one per project
- Entry points define runnable commands with typed parameters and default values
- Projects can be run from local directories, Git repos, or remote URIs using CLI or Python API
- MLflow Recipes (formerly Pipelines) are opinionated templates with 6 standard steps
- Profiles allow different configurations for local vs. cloud environments
Lilly Tech Systems