Intermediate
MLflow Projects
Package ML code for reproducible, reusable experiments with MLproject files, environment specs, and entry points.
What are MLflow Projects?
An MLflow Project is a format for packaging ML code in a reusable and reproducible way. It specifies the code, its dependencies, and entry points with parameters, so anyone can run the exact same experiment.
The MLproject File
YAML — MLproject
name: customer-churn-prediction
conda_env: conda.yaml
# OR: docker_env:
# image: my-ml-image:latest
entry_points:
main:
parameters:
data_path: {type: str, default: "data/train.csv"}
n_estimators: {type: int, default: 100}
max_depth: {type: int, default: 10}
learning_rate: {type: float, default: 0.1}
command: "python train.py --data-path {data_path} --n-estimators {n_estimators} --max-depth {max_depth} --lr {learning_rate}"
validate:
parameters:
model_uri: {type: str}
test_data: {type: str, default: "data/test.csv"}
command: "python validate.py --model-uri {model_uri} --test-data {test_data}"
preprocess:
parameters:
raw_data: {type: str}
output_path: {type: str, default: "data/processed"}
command: "python preprocess.py --raw-data {raw_data} --output {output_path}"
Environment Specification
Conda Environment
YAML — conda.yaml
name: churn-prediction
channels:
- defaults
- conda-forge
dependencies:
- python=3.11
- pip
- pip:
- mlflow>=2.10
- scikit-learn>=1.4
- pandas>=2.2
- numpy>=1.26
- xgboost>=2.0
Docker Environment
YAML — MLproject with Docker
name: deep-learning-project
docker_env:
image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
volumes: ["/data/datasets:/datasets"]
environment: [["CUDA_VISIBLE_DEVICES", "0,1"]]
entry_points:
main:
parameters:
epochs: {type: int, default: 50}
batch_size: {type: int, default: 64}
command: "python train.py --epochs {epochs} --batch-size {batch_size}"
Running Projects
Shell — Running MLflow projects
# Run from local directory
mlflow run . -P n_estimators=200 -P max_depth=15
# Run a specific entry point
mlflow run . -e validate -P model_uri="runs:/abc123/model"
# Run from GitHub
mlflow run https://github.com/user/ml-project -P learning_rate=0.05
# Run with a specific Git branch or tag
mlflow run https://github.com/user/ml-project -v feature-branch
# Run with a specific experiment
mlflow run . --experiment-name "production-training"
Running Projects Programmatically
Python — Running projects from code
import mlflow
# Run a local project
run = mlflow.projects.run(
uri=".",
entry_point="main",
parameters={
"n_estimators": 200,
"max_depth": 15,
"learning_rate": 0.05,
},
experiment_name="churn-prediction",
)
print(f"Run ID: {run.run_id}")
# Run from GitHub
run = mlflow.projects.run(
uri="https://github.com/user/ml-project",
version="v2.0",
parameters={"epochs": 100},
)
Chaining Projects
Python — Multi-step workflow
import mlflow
with mlflow.start_run(run_name="full-pipeline") as parent_run:
# Step 1: Preprocess data
preprocess_run = mlflow.projects.run(
uri=".",
entry_point="preprocess",
parameters={"raw_data": "s3://data/raw"},
)
# Step 2: Train model using preprocessed data
train_run = mlflow.projects.run(
uri=".",
entry_point="main",
parameters={
"data_path": "data/processed/train.csv",
"n_estimators": 200,
},
)
# Step 3: Validate the trained model
model_uri = f"runs:/{train_run.run_id}/model"
validate_run = mlflow.projects.run(
uri=".",
entry_point="validate",
parameters={"model_uri": model_uri},
)
Reproducibility guarantee: MLflow Projects capture the code (via Git commit), environment (via conda.yaml or Docker), and parameters (via entry points). This means anyone can reproduce your experiment exactly, even months later.
Lilly Tech Systems