Advanced

Ray on Kubernetes Best Practices

Production-proven patterns for monitoring, fault tolerance, resource optimization, and operating multi-tenant Ray clusters on Kubernetes.

Monitoring

Integrate Ray's built-in metrics with Prometheus and Grafana:

Enable Prometheus metrics export in the Ray head node configuration
Monitor Ray-specific metrics: task queue depth, object store usage, worker utilization
Combine with DCGM GPU metrics for full-stack observability
Set up alerts for head node failures, OOM events, and task backlogs

Fault Tolerance Patterns

Object reconstruction: Ray automatically reconstructs lost objects by re-executing tasks.
Actor checkpointing: Periodically save actor state to persistent storage for recovery.
Head node HA: Use GCS fault tolerance with external Redis for head node recovery without losing cluster state.
Spot instance support: Configure graceful shutdown handlers and checkpointing for preemptible nodes.

Resource Optimization

# Use RayJob for batch workloads (auto-teardown)
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: batch-training
spec:
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 300  # Cleanup after 5 min
  entrypoint: python train.py
  rayClusterSpec:
    # ... cluster spec ...

Production Checklist

Image versioning: Pin Ray image versions across head and workers to avoid version mismatches.
Resource limits: Always set CPU, memory, and GPU limits to prevent resource contention.
Head node protection: Set num-cpus: "0" on head to keep it responsive for cluster management.
Persistent storage: Mount PVCs for checkpoints, logs, and shared data across workers.
Network policies: Restrict inter-pod communication to Ray cluster members only.
RBAC: Use dedicated service accounts with minimal permissions for KubeRay.

✅

Congratulations! You've completed the Ray on Kubernetes course. You can now deploy, scale, and operate Ray clusters for distributed AI training and serving on Kubernetes with production-grade reliability.

← Previous Ray Serve