Advanced

Ray on Kubernetes Best Practices

Production-proven patterns for monitoring, fault tolerance, resource optimization, and operating multi-tenant Ray clusters on Kubernetes.

Monitoring

Integrate Ray's built-in metrics with Prometheus and Grafana:

  • Enable Prometheus metrics export in the Ray head node configuration
  • Monitor Ray-specific metrics: task queue depth, object store usage, worker utilization
  • Combine with DCGM GPU metrics for full-stack observability
  • Set up alerts for head node failures, OOM events, and task backlogs

Fault Tolerance Patterns

  • Object reconstruction: Ray automatically reconstructs lost objects by re-executing tasks.
  • Actor checkpointing: Periodically save actor state to persistent storage for recovery.
  • Head node HA: Use GCS fault tolerance with external Redis for head node recovery without losing cluster state.
  • Spot instance support: Configure graceful shutdown handlers and checkpointing for preemptible nodes.

Resource Optimization

# Use RayJob for batch workloads (auto-teardown)
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: batch-training
spec:
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 300  # Cleanup after 5 min
  entrypoint: python train.py
  rayClusterSpec:
    # ... cluster spec ...

Production Checklist

  • Image versioning: Pin Ray image versions across head and workers to avoid version mismatches.
  • Resource limits: Always set CPU, memory, and GPU limits to prevent resource contention.
  • Head node protection: Set num-cpus: "0" on head to keep it responsive for cluster management.
  • Persistent storage: Mount PVCs for checkpoints, logs, and shared data across workers.
  • Network policies: Restrict inter-pod communication to Ray cluster members only.
  • RBAC: Use dedicated service accounts with minimal permissions for KubeRay.
Congratulations! You've completed the Ray on Kubernetes course. You can now deploy, scale, and operate Ray clusters for distributed AI training and serving on Kubernetes with production-grade reliability.