Beginner

Core Concepts

The fundamental building blocks of Kubernetes — Pods, Deployments, Services, and resource management — applied to machine learning workloads.

Pods: The Smallest Unit

A Pod is the smallest deployable unit in Kubernetes. It wraps one or more containers that share the same network namespace and storage volumes. For ML workloads, a Pod typically runs a single training script or inference server.

# Example: Pod running a PyTorch training container
apiVersion: v1
kind: Pod
metadata:
  name: pytorch-training
  labels:
    app: ml-training
    framework: pytorch
spec:
  containers:
  - name: trainer
    image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
    command: ["python", "train.py"]
    resources:
      requests:
        memory: "8Gi"
        cpu: "4"
      limits:
        memory: "16Gi"
        cpu: "8"
💡
ML context: Always set resource requests and limits for ML Pods. Training jobs are resource-intensive — without limits, a single training job can consume all cluster resources and starve other workloads.

Deployments: Managing Replicas

A Deployment manages a set of identical Pod replicas. It handles rolling updates, rollbacks, and scaling. For AI, Deployments are ideal for model serving — running multiple replicas of an inference server behind a load balancer.

# Example: Deployment for model serving
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: inference-server
  template:
    metadata:
      labels:
        app: inference-server
    spec:
      containers:
      - name: server
        image: myregistry/bert-serving:v1.2
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

Key Deployment Features for ML

  • Rolling updates — Deploy new model versions without downtime. Gradually replace old Pods with new ones.
  • Rollbacks — If a new model version performs poorly, roll back to the previous version with one command.
  • Scaling — Scale up replicas during peak inference load, scale down during quiet periods.
  • Readiness probes — Essential for ML: models take time to load into memory. The probe ensures traffic is only sent to Pods that have finished loading.

Services: Exposing Workloads

A Service provides a stable network endpoint for a set of Pods. Pods are ephemeral (they can be created and destroyed), but Services provide a consistent IP and DNS name.

Service Types

  • ClusterIP (default) — Internal cluster access only. Use for internal model APIs that other services call.
  • NodePort — Exposes on a static port on each node. Use for development and testing.
  • LoadBalancer — Provisions an external load balancer (cloud providers). Use for production model endpoints.
# Example: Service for model serving
apiVersion: v1
kind: Service
metadata:
  name: model-api
spec:
  selector:
    app: inference-server
  ports:
  - port: 80
    targetPort: 8080
  type: LoadBalancer

Namespaces: Isolation for Teams

Namespaces provide logical isolation within a cluster. In ML organizations, you typically have separate namespaces for different teams or environments.

  • ml-training — For training jobs with GPU access
  • ml-serving — For production inference endpoints
  • ml-experiments — For data scientists running experiments
  • data-pipeline — For ETL and data preprocessing

Resource Management

Resource management is critical for ML workloads because training jobs consume significant CPU, memory, and GPU resources.

Requests vs Limits

  • Requests — The minimum resources the container needs. The scheduler uses this to find a suitable node.
  • Limits — The maximum resources the container can use. If exceeded, the container is throttled (CPU) or killed (memory).

ResourceQuotas

ResourceQuotas limit the total resources a namespace can consume. This prevents a single team from monopolizing the cluster.

# Example: ResourceQuota for ML training namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: ml-training-quota
  namespace: ml-training
spec:
  hard:
    requests.cpu: "32"
    requests.memory: "128Gi"
    limits.cpu: "64"
    limits.memory: "256Gi"
    pods: "20"

Labels and Selectors

Labels are key-value pairs attached to objects. Selectors filter objects by labels. For ML, use labels to organize workloads by framework, team, experiment, or model version.

  • framework: pytorch or framework: tensorflow
  • workload-type: training or workload-type: inference
  • model-version: v1.2.3
  • team: nlp or team: computer-vision

Practice Questions

📝
Q1: A data science team deploys a model serving endpoint with 3 replicas. They want to ensure that no traffic is sent to a Pod until the model is fully loaded into memory (which takes about 60 seconds). Which Kubernetes feature should they use?

A) Liveness probe
B) Readiness probe
C) Startup probe
D) Init container
Show Answer

B) Readiness probe. A readiness probe determines when a Pod is ready to receive traffic. Until the probe succeeds, the Pod is removed from Service endpoints. This is essential for ML serving because models need time to load into memory. A liveness probe checks if the container is alive (restart if not), which is different.

📝
Q2: An ML platform team wants to prevent the NLP team from using more than 64 CPUs and 256Gi of memory in their namespace. Which Kubernetes resource should they create?

A) LimitRange
B) ResourceQuota
C) PriorityClass
D) NetworkPolicy
Show Answer

B) ResourceQuota. A ResourceQuota limits the total aggregate resources that can be consumed in a namespace. LimitRange sets default and max resources per Pod/container, not total namespace limits. PriorityClass controls Pod scheduling priority, and NetworkPolicy controls network traffic.

📝
Q3: You need to deploy a new version of a model serving application without any downtime. The current version is running with 5 replicas. Which Kubernetes object handles this automatically?

A) Job
B) DaemonSet
C) Deployment
D) StatefulSet
Show Answer

C) Deployment. Deployments support rolling updates by default, gradually replacing old Pods with new ones to maintain availability. Jobs are for batch tasks, DaemonSets run one Pod per node, and StatefulSets are for stateful applications that need stable identities.

📝
Q4: A training Pod is being killed repeatedly with OOMKilled status. What is the most likely cause?

A) The CPU limit is too low
B) The memory limit is too low
C) The readiness probe is failing
D) The node has no GPU
Show Answer

B) The memory limit is too low. OOMKilled (Out Of Memory Killed) means the container exceeded its memory limit and was terminated by the kubelet. The fix is to increase the memory limit in the Pod spec. ML training jobs often need large amounts of memory for loading datasets and model parameters.

📝
Q5: Which Service type should you use to expose a production model inference API to external clients on a cloud provider?

A) ClusterIP
B) NodePort
C) LoadBalancer
D) ExternalName
Show Answer

C) LoadBalancer. On cloud providers, a LoadBalancer Service automatically provisions an external load balancer with a public IP, making it the standard choice for production APIs. ClusterIP is internal only, NodePort exposes on a static high port (not ideal for production), and ExternalName is for mapping to external DNS names.