Intermediate

Networking & Storage

Persistent storage for datasets and model checkpoints, Ingress for exposing model APIs, NetworkPolicies for securing ML pipelines, and StorageClasses for different workload needs.

Persistent Storage for ML

ML workloads need persistent storage for datasets, model checkpoints, training logs, and saved models. Kubernetes provides PersistentVolumes (PV) and PersistentVolumeClaims (PVC) to decouple storage from Pods.

PersistentVolumeClaim for a Dataset

# PVC for training dataset
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: imagenet-dataset
  namespace: ml-training
spec:
  accessModes:
    - ReadOnlyMany       # Multiple training Pods can read simultaneously
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 500Gi

Access Modes for ML

  • ReadWriteOnce (RWO) — Single node read-write. Use for model checkpoints where only one training Pod writes.
  • ReadOnlyMany (ROX) — Multiple nodes can read. Ideal for shared datasets that multiple training jobs access simultaneously.
  • ReadWriteMany (RWX) — Multiple nodes can read and write. Use for shared model registries or experiment logs. Requires NFS or a distributed filesystem.
💡
ML pattern: Use ReadOnlyMany for datasets (shared across Pods), ReadWriteOnce for model checkpoints (single writer), and ReadWriteMany for shared experiment tracking. This minimizes storage costs while meeting access requirements.

StorageClasses

StorageClasses define different tiers of storage. For ML workloads, you typically need at least two classes.

# Fast SSD storage for training data
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  iopsPerGB: "50"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
---
# Standard storage for model archives
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard-hdd
provisioner: kubernetes.io/aws-ebs
parameters:
  type: st1
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

Ingress for Model APIs

Ingress exposes HTTP/HTTPS routes from outside the cluster to Services inside. For ML, Ingress is used to expose model serving endpoints with TLS, path-based routing, and load balancing.

# Ingress for multiple model endpoints
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ml-api-ingress
  namespace: ml-serving
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - api.ml.example.com
    secretName: ml-api-tls
  rules:
  - host: api.ml.example.com
    http:
      paths:
      - path: /v1/sentiment
        pathType: Prefix
        backend:
          service:
            name: sentiment-model-svc
            port:
              number: 80
      - path: /v1/translate
        pathType: Prefix
        backend:
          service:
            name: translation-model-svc
            port:
              number: 80
      - path: /v1/embeddings
        pathType: Prefix
        backend:
          service:
            name: embedding-model-svc
            port:
              number: 80
ML-specific annotations: Increase proxy-body-size for endpoints that receive large inputs (images, documents). Increase proxy-read-timeout for models with long inference times (LLMs, complex vision models). Default timeouts will cause 504 errors for slow models.

NetworkPolicies for ML Security

NetworkPolicies control traffic flow between Pods. In ML environments, use them to isolate training from serving, restrict access to data stores, and secure inter-service communication.

# Allow only serving Pods to access the model store
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: model-store-access
  namespace: ml-serving
spec:
  podSelector:
    matchLabels:
      role: model-store
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: inference-server
    ports:
    - port: 9000
      protocol: TCP

Common ML Network Policies

  • Isolate training namespace — Prevent training Pods from accessing the internet (data exfiltration prevention)
  • Restrict data access — Only authorized Pods can access dataset storage
  • Allow distributed training — Permit inter-Pod communication within a training Job for gradient synchronization
  • Expose serving only — Only model serving Pods can receive external traffic via Ingress

Volume Mounts in ML Pods

# Training Pod with multiple volume mounts
apiVersion: v1
kind: Pod
metadata:
  name: training-pod
spec:
  containers:
  - name: trainer
    image: myregistry/trainer:v1.0
    volumeMounts:
    - name: dataset
      mountPath: /data
      readOnly: true        # Dataset is read-only
    - name: checkpoints
      mountPath: /checkpoints  # Save model checkpoints
    - name: logs
      mountPath: /logs         # TensorBoard logs
    - name: config
      mountPath: /config       # Training configuration
  volumes:
  - name: dataset
    persistentVolumeClaim:
      claimName: imagenet-dataset
  - name: checkpoints
    persistentVolumeClaim:
      claimName: checkpoint-pvc
  - name: logs
    emptyDir: {}             # Ephemeral, lost when Pod dies
  - name: config
    configMap:
      name: training-config

Practice Questions

📝
Q1: A team has 10 training Pods that need to read the same 500GB dataset simultaneously. Each Pod only reads data (never writes). Which PVC access mode should they use?

A) ReadWriteOnce
B) ReadOnlyMany
C) ReadWriteMany
D) ReadWriteOncePod
Show Answer

B) ReadOnlyMany (ROX). ReadOnlyMany allows multiple nodes to mount the volume in read-only mode simultaneously. Since the training Pods only read the dataset, this is the most appropriate and secure choice. ReadWriteMany would also work but grants unnecessary write permissions.

📝
Q2: A model serving endpoint returns 504 Gateway Timeout errors when processing large images. The model takes up to 90 seconds for complex images. The default Ingress timeout is 60 seconds. How do you fix this?

A) Increase the readiness probe timeout
B) Add an annotation to increase the Ingress proxy-read-timeout
C) Increase the Pod memory limit
D) Add more replicas to the Deployment
Show Answer

B) Add an annotation to increase the Ingress proxy-read-timeout. The 504 error occurs because the Ingress controller times out before the model finishes processing. Add nginx.ingress.kubernetes.io/proxy-read-timeout: "120" to the Ingress annotations. Readiness probes and memory limits are unrelated to request timeout.

📝
Q3: You want training data to persist even after the PVC is deleted, so it can be manually recovered. Which reclaimPolicy should the StorageClass use?

A) Delete
B) Retain
C) Recycle
D) Archive
Show Answer

B) Retain. The Retain reclaim policy keeps the underlying PersistentVolume and its data when the PVC is deleted. An administrator can then manually recover the data. Delete removes the volume when the PVC is deleted. Recycle is deprecated. Archive does not exist in Kubernetes.

📝
Q4: You need to expose three different model APIs (sentiment, translation, embeddings) through a single external hostname with different URL paths. Which Kubernetes resource should you use?

A) Three LoadBalancer Services
B) One Ingress with path-based routing
C) Three NodePort Services
D) One ClusterIP Service
Show Answer

B) One Ingress with path-based routing. Ingress supports path-based routing, directing /v1/sentiment, /v1/translate, and /v1/embeddings to different backend Services through a single external IP and hostname. Three LoadBalancer Services would work but waste IP addresses and cost more on cloud providers.

📝
Q5: A security audit requires that only model serving Pods can access the model storage service on port 9000. All other Pods must be blocked. Which Kubernetes resource enforces this?

A) ResourceQuota
B) RBAC Role
C) NetworkPolicy
D) PodSecurityPolicy
Show Answer

C) NetworkPolicy. A NetworkPolicy with an ingress rule that selects the model-store Pods and allows traffic only from Pods labeled app: inference-server on port 9000. RBAC controls API access (who can create/read resources), not network traffic between Pods. ResourceQuotas limit resource consumption, not network access.