Intermediate

MLflow Projects

Package ML code for reproducible, reusable experiments with MLproject files, environment specs, and entry points.

What are MLflow Projects?

An MLflow Project is a format for packaging ML code in a reusable and reproducible way. It specifies the code, its dependencies, and entry points with parameters, so anyone can run the exact same experiment.

The MLproject File

YAML — MLproject
name: customer-churn-prediction

conda_env: conda.yaml
# OR: docker_env:
#       image: my-ml-image:latest

entry_points:
  main:
    parameters:
      data_path: {type: str, default: "data/train.csv"}
      n_estimators: {type: int, default: 100}
      max_depth: {type: int, default: 10}
      learning_rate: {type: float, default: 0.1}
    command: "python train.py --data-path {data_path} --n-estimators {n_estimators} --max-depth {max_depth} --lr {learning_rate}"

  validate:
    parameters:
      model_uri: {type: str}
      test_data: {type: str, default: "data/test.csv"}
    command: "python validate.py --model-uri {model_uri} --test-data {test_data}"

  preprocess:
    parameters:
      raw_data: {type: str}
      output_path: {type: str, default: "data/processed"}
    command: "python preprocess.py --raw-data {raw_data} --output {output_path}"

Environment Specification

Conda Environment

YAML — conda.yaml
name: churn-prediction
channels:
  - defaults
  - conda-forge
dependencies:
  - python=3.11
  - pip
  - pip:
    - mlflow>=2.10
    - scikit-learn>=1.4
    - pandas>=2.2
    - numpy>=1.26
    - xgboost>=2.0

Docker Environment

YAML — MLproject with Docker
name: deep-learning-project

docker_env:
  image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
  volumes: ["/data/datasets:/datasets"]
  environment: [["CUDA_VISIBLE_DEVICES", "0,1"]]

entry_points:
  main:
    parameters:
      epochs: {type: int, default: 50}
      batch_size: {type: int, default: 64}
    command: "python train.py --epochs {epochs} --batch-size {batch_size}"

Running Projects

Shell — Running MLflow projects
# Run from local directory
mlflow run . -P n_estimators=200 -P max_depth=15

# Run a specific entry point
mlflow run . -e validate -P model_uri="runs:/abc123/model"

# Run from GitHub
mlflow run https://github.com/user/ml-project -P learning_rate=0.05

# Run with a specific Git branch or tag
mlflow run https://github.com/user/ml-project -v feature-branch

# Run with a specific experiment
mlflow run . --experiment-name "production-training"

Running Projects Programmatically

Python — Running projects from code
import mlflow

# Run a local project
run = mlflow.projects.run(
    uri=".",
    entry_point="main",
    parameters={
        "n_estimators": 200,
        "max_depth": 15,
        "learning_rate": 0.05,
    },
    experiment_name="churn-prediction",
)

print(f"Run ID: {run.run_id}")

# Run from GitHub
run = mlflow.projects.run(
    uri="https://github.com/user/ml-project",
    version="v2.0",
    parameters={"epochs": 100},
)

Chaining Projects

Python — Multi-step workflow
import mlflow

with mlflow.start_run(run_name="full-pipeline") as parent_run:
    # Step 1: Preprocess data
    preprocess_run = mlflow.projects.run(
        uri=".",
        entry_point="preprocess",
        parameters={"raw_data": "s3://data/raw"},
    )

    # Step 2: Train model using preprocessed data
    train_run = mlflow.projects.run(
        uri=".",
        entry_point="main",
        parameters={
            "data_path": "data/processed/train.csv",
            "n_estimators": 200,
        },
    )

    # Step 3: Validate the trained model
    model_uri = f"runs:/{train_run.run_id}/model"
    validate_run = mlflow.projects.run(
        uri=".",
        entry_point="validate",
        parameters={"model_uri": model_uri},
    )
💡
Reproducibility guarantee: MLflow Projects capture the code (via Git commit), environment (via conda.yaml or Docker), and parameters (via entry points). This means anyone can reproduce your experiment exactly, even months later.