Intermediate

Configuring Ray Clusters

Configure head and worker nodes, set up Ray autoscaling with KubeRay, manage GPU resources, and build heterogeneous clusters with multiple worker groups.

Autoscaling Configuration

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: autoscaling-cluster
spec:
  rayVersion: "2.9.0"
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Default
    idleTimeoutSeconds: 300
  headGroupSpec:
    rayStartParams:
      num-cpus: "0"  # Don't schedule tasks on head
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.9.0
          resources:
            limits:
              cpu: "2"
              memory: "4Gi"
  workerGroupSpecs:
  - groupName: gpu-workers
    replicas: 1
    minReplicas: 0
    maxReplicas: 10
    rayStartParams:
      num-gpus: "1"
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.9.0-gpu
          resources:
            limits:
              cpu: "8"
              memory: "32Gi"
              nvidia.com/gpu: 1

Heterogeneous Clusters

Define multiple worker groups with different resource profiles for varied workloads:

workerGroupSpecs:
- groupName: cpu-workers     # For data processing
  replicas: 4
  minReplicas: 2
  maxReplicas: 20
  template:
    spec:
      containers:
      - name: ray-worker
        image: rayproject/ray:2.9.0
        resources:
          limits:
            cpu: "16"
            memory: "64Gi"
- groupName: gpu-workers     # For training/inference
  replicas: 2
  minReplicas: 0
  maxReplicas: 8
  template:
    spec:
      containers:
      - name: ray-worker
        image: rayproject/ray:2.9.0-gpu
        resources:
          limits:
            cpu: "8"
            memory: "64Gi"
            nvidia.com/gpu: 4
💡
Autoscaling behavior: Ray's autoscaler works with Kubernetes: Ray requests more workers based on pending tasks, KubeRay creates pods, and the Kubernetes cluster autoscaler provisions nodes if needed. Scale-down happens after the idle timeout.
Cost optimization: Set minReplicas: 0 for GPU workers so they scale to zero when idle. Use num-cpus: "0" on the head node to prevent it from running compute tasks, keeping it available for cluster management.