Advanced

Best Practices

Build reproducible, production-ready ML workflows in R with proper deployment, monitoring, and engineering practices.

ML Workflow in R

  1. Define the Problem

    Clearly state what you are predicting and why. Choose the right metric.

  2. Explore the Data

    Use ggplot2, skimr, and DataExplorer for thorough EDA.

  3. Preprocess

    Use recipes for reproducible feature engineering.

  4. Train & Evaluate

    Use cross-validation, never evaluate on training data.

  5. Tune

    Use tune_grid or tune_bayes for hyperparameter optimization.

  6. Deploy

    Use plumber or vetiver to serve models as APIs.

Reproducibility

R
# ALWAYS set seeds for reproducibility
set.seed(42)

# Use renv for package versioning
renv::init()
renv::snapshot()

# Log your R session info
sessionInfo()

# Save model artifacts
saveRDS(final_model, "models/rf_model_v1.rds")
model <- readRDS("models/rf_model_v1.rds")

Model Deployment with plumber

R (plumber_api.R)
# plumber_api.R
library(tidymodels)

model <- readRDS("model.rds")

#* Predict species from iris measurements
#* @param sepal_length Sepal length in cm
#* @param sepal_width Sepal width in cm
#* @param petal_length Petal length in cm
#* @param petal_width Petal width in cm
#* @post /predict
function(sepal_length, sepal_width, petal_length, petal_width) {
  new_data <- tibble(
    Sepal.Length = as.numeric(sepal_length),
    Sepal.Width = as.numeric(sepal_width),
    Petal.Length = as.numeric(petal_length),
    Petal.Width = as.numeric(petal_width)
  )
  predict(model, new_data)$.pred_class
}
R
# Run the API
library(plumber)
pr <- plumb("plumber_api.R")
pr$run(port = 8000)

vetiver for MLOps

R
library(vetiver)

# Create a vetiver model object
v <- vetiver_model(final_fit, model_name = "iris-classifier")

# Create a plumber API automatically
pr <- vetiver_api(v)

# Write a Dockerfile
vetiver_write_docker(v)

# Pin model for versioning
library(pins)
board <- board_folder("models")
vetiver_pin_write(board, v)

Docker for R ML

Dockerfile
FROM rocker/r-ver:4.3.0

RUN install2.r --error \
    tidymodels ranger plumber vetiver

COPY model.rds /app/model.rds
COPY plumber_api.R /app/plumber_api.R

EXPOSE 8000

CMD ["R", "-e", "plumber::plumb('/app/plumber_api.R')$run(host='0.0.0.0', port=8000)"]

R vs Python for Production ML

AspectRPython
API frameworkplumberFlask, FastAPI (more mature)
MLOps toolingvetiver, pinsMLflow, Kubeflow (larger ecosystem)
Docker supportrocker imagesNative Python images (smaller)
Cloud supportLimited (growing)Extensive (SageMaker, Vertex AI)
Model analysisExcellent (statistical rigor)Good (SHAP, LIME)

Common Mistakes

  • Data leakage: Never use test data during preprocessing. Use recipes in a workflow to prevent this.
  • Not setting seeds: Always use set.seed() before any random operation.
  • Evaluating on training data: Always use holdout or cross-validation.
  • Ignoring class imbalance: Use stratified sampling and appropriate metrics (F1, AUC).
  • Over-tuning: Use nested cross-validation to avoid optimistic estimates.

Frequently Asked Questions

Yes. Use plumber to create REST APIs, vetiver for MLOps workflows, and Docker for containerization. For large-scale deployments, some teams serialize R models and serve them via Python or use Posit Connect.

Use R when statistical rigor, interpretability, and visualization are priorities (research, pharma, academia). Use Python for deep learning, production ML systems, and when integrating with larger engineering stacks. Many teams use both.

Use data.table or arrow for data loading, Parquet format for storage, and DuckDB for SQL-based queries on large files. For distributed computing, consider sparklyr to connect R to Apache Spark.

The torch package provides PyTorch-like functionality natively in R. The keras package interfaces with TensorFlow. However, the deep learning ecosystem is more mature in Python, so many R users switch to Python for deep learning tasks.