Advanced

Best Practices

Build reproducible, production-ready ML workflows in R with proper deployment, monitoring, and engineering practices.

ML Workflow in R

Define the Problem
Clearly state what you are predicting and why. Choose the right metric.
Explore the Data
Use ggplot2, skimr, and DataExplorer for thorough EDA.
Preprocess
Use recipes for reproducible feature engineering.
Train & Evaluate
Use cross-validation, never evaluate on training data.
Tune
Use tune_grid or tune_bayes for hyperparameter optimization.
Deploy
Use plumber or vetiver to serve models as APIs.

Reproducibility

# ALWAYS set seeds for reproducibility
set.seed(42)

# Use renv for package versioning
renv::init()
renv::snapshot()

# Log your R session info
sessionInfo()

# Save model artifacts
saveRDS(final_model, "models/rf_model_v1.rds")
model <- readRDS("models/rf_model_v1.rds")

Model Deployment with plumber

R (plumber_api.R)

# plumber_api.R
library(tidymodels)

model <- readRDS("model.rds")

#* Predict species from iris measurements
#* @param sepal_length Sepal length in cm
#* @param sepal_width Sepal width in cm
#* @param petal_length Petal length in cm
#* @param petal_width Petal width in cm
#* @post /predict
function(sepal_length, sepal_width, petal_length, petal_width) {
  new_data <- tibble(
    Sepal.Length = as.numeric(sepal_length),
    Sepal.Width = as.numeric(sepal_width),
    Petal.Length = as.numeric(petal_length),
    Petal.Width = as.numeric(petal_width)
  )
  predict(model, new_data)$.pred_class
}

# Run the API
library(plumber)
pr <- plumb("plumber_api.R")
pr$run(port = 8000)

vetiver for MLOps

library(vetiver)

# Create a vetiver model object
v <- vetiver_model(final_fit, model_name = "iris-classifier")

# Create a plumber API automatically
pr <- vetiver_api(v)

# Write a Dockerfile
vetiver_write_docker(v)

# Pin model for versioning
library(pins)
board <- board_folder("models")
vetiver_pin_write(board, v)

Docker for R ML

Dockerfile

FROM rocker/r-ver:4.3.0

RUN install2.r --error \
    tidymodels ranger plumber vetiver

COPY model.rds /app/model.rds
COPY plumber_api.R /app/plumber_api.R

EXPOSE 8000

CMD ["R", "-e", "plumber::plumb('/app/plumber_api.R')$run(host='0.0.0.0', port=8000)"]

R vs Python for Production ML

Aspect	R	Python
API framework	plumber	Flask, FastAPI (more mature)
MLOps tooling	vetiver, pins	MLflow, Kubeflow (larger ecosystem)
Docker support	rocker images	Native Python images (smaller)
Cloud support	Limited (growing)	Extensive (SageMaker, Vertex AI)
Model analysis	Excellent (statistical rigor)	Good (SHAP, LIME)

Common Mistakes

Data leakage: Never use test data during preprocessing. Use recipes in a workflow to prevent this.
Not setting seeds: Always use set.seed() before any random operation.
Evaluating on training data: Always use holdout or cross-validation.
Ignoring class imbalance: Use stratified sampling and appropriate metrics (F1, AUC).
Over-tuning: Use nested cross-validation to avoid optimistic estimates.

Frequently Asked Questions

Yes. Use plumber to create REST APIs, vetiver for MLOps workflows, and Docker for containerization. For large-scale deployments, some teams serialize R models and serve them via Python or use Posit Connect.

Use R when statistical rigor, interpretability, and visualization are priorities (research, pharma, academia). Use Python for deep learning, production ML systems, and when integrating with larger engineering stacks. Many teams use both.

Use data.table or arrow for data loading, Parquet format for storage, and DuckDB for SQL-based queries on large files. For distributed computing, consider sparklyr to connect R to Apache Spark.

The torch package provides PyTorch-like functionality natively in R. The keras package interfaces with TensorFlow. However, the deep learning ecosystem is more mature in Python, so many R users switch to Python for deep learning tasks.

← Previous caret & mlr3

Best Practices

ML Workflow in R

Define the Problem

Explore the Data

Preprocess

Train & Evaluate

Tune

Deploy