Best Practices
Build reproducible, production-ready ML workflows in R with proper deployment, monitoring, and engineering practices.
ML Workflow in R
Define the Problem
Clearly state what you are predicting and why. Choose the right metric.
Explore the Data
Use ggplot2, skimr, and DataExplorer for thorough EDA.
Preprocess
Use recipes for reproducible feature engineering.
Train & Evaluate
Use cross-validation, never evaluate on training data.
Tune
Use tune_grid or tune_bayes for hyperparameter optimization.
Deploy
Use plumber or vetiver to serve models as APIs.
Reproducibility
# ALWAYS set seeds for reproducibility set.seed(42) # Use renv for package versioning renv::init() renv::snapshot() # Log your R session info sessionInfo() # Save model artifacts saveRDS(final_model, "models/rf_model_v1.rds") model <- readRDS("models/rf_model_v1.rds")
Model Deployment with plumber
# plumber_api.R library(tidymodels) model <- readRDS("model.rds") #* Predict species from iris measurements #* @param sepal_length Sepal length in cm #* @param sepal_width Sepal width in cm #* @param petal_length Petal length in cm #* @param petal_width Petal width in cm #* @post /predict function(sepal_length, sepal_width, petal_length, petal_width) { new_data <- tibble( Sepal.Length = as.numeric(sepal_length), Sepal.Width = as.numeric(sepal_width), Petal.Length = as.numeric(petal_length), Petal.Width = as.numeric(petal_width) ) predict(model, new_data)$.pred_class }
# Run the API library(plumber) pr <- plumb("plumber_api.R") pr$run(port = 8000)
vetiver for MLOps
library(vetiver) # Create a vetiver model object v <- vetiver_model(final_fit, model_name = "iris-classifier") # Create a plumber API automatically pr <- vetiver_api(v) # Write a Dockerfile vetiver_write_docker(v) # Pin model for versioning library(pins) board <- board_folder("models") vetiver_pin_write(board, v)
Docker for R ML
FROM rocker/r-ver:4.3.0
RUN install2.r --error \
tidymodels ranger plumber vetiver
COPY model.rds /app/model.rds
COPY plumber_api.R /app/plumber_api.R
EXPOSE 8000
CMD ["R", "-e", "plumber::plumb('/app/plumber_api.R')$run(host='0.0.0.0', port=8000)"]
R vs Python for Production ML
| Aspect | R | Python |
|---|---|---|
| API framework | plumber | Flask, FastAPI (more mature) |
| MLOps tooling | vetiver, pins | MLflow, Kubeflow (larger ecosystem) |
| Docker support | rocker images | Native Python images (smaller) |
| Cloud support | Limited (growing) | Extensive (SageMaker, Vertex AI) |
| Model analysis | Excellent (statistical rigor) | Good (SHAP, LIME) |
Common Mistakes
- Data leakage: Never use test data during preprocessing. Use recipes in a workflow to prevent this.
- Not setting seeds: Always use
set.seed()before any random operation. - Evaluating on training data: Always use holdout or cross-validation.
- Ignoring class imbalance: Use stratified sampling and appropriate metrics (F1, AUC).
- Over-tuning: Use nested cross-validation to avoid optimistic estimates.
Frequently Asked Questions
Yes. Use plumber to create REST APIs, vetiver for MLOps workflows, and Docker for containerization. For large-scale deployments, some teams serialize R models and serve them via Python or use Posit Connect.
Use R when statistical rigor, interpretability, and visualization are priorities (research, pharma, academia). Use Python for deep learning, production ML systems, and when integrating with larger engineering stacks. Many teams use both.
Use data.table or arrow for data loading, Parquet format for storage, and DuckDB for SQL-based queries on large files. For distributed computing, consider sparklyr to connect R to Apache Spark.
The torch package provides PyTorch-like functionality natively in R. The keras package interfaces with TensorFlow. However, the deep learning ecosystem is more mature in Python, so many R users switch to Python for deep learning tasks.