Networking & Storage
Persistent storage for datasets and model checkpoints, Ingress for exposing model APIs, NetworkPolicies for securing ML pipelines, and StorageClasses for different workload needs.
Persistent Storage for ML
ML workloads need persistent storage for datasets, model checkpoints, training logs, and saved models. Kubernetes provides PersistentVolumes (PV) and PersistentVolumeClaims (PVC) to decouple storage from Pods.
PersistentVolumeClaim for a Dataset
# PVC for training dataset
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: imagenet-dataset
namespace: ml-training
spec:
accessModes:
- ReadOnlyMany # Multiple training Pods can read simultaneously
storageClassName: fast-ssd
resources:
requests:
storage: 500Gi
Access Modes for ML
- ReadWriteOnce (RWO) — Single node read-write. Use for model checkpoints where only one training Pod writes.
- ReadOnlyMany (ROX) — Multiple nodes can read. Ideal for shared datasets that multiple training jobs access simultaneously.
- ReadWriteMany (RWX) — Multiple nodes can read and write. Use for shared model registries or experiment logs. Requires NFS or a distributed filesystem.
StorageClasses
StorageClasses define different tiers of storage. For ML workloads, you typically need at least two classes.
# Fast SSD storage for training data
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp3
iopsPerGB: "50"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
---
# Standard storage for model archives
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard-hdd
provisioner: kubernetes.io/aws-ebs
parameters:
type: st1
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
Ingress for Model APIs
Ingress exposes HTTP/HTTPS routes from outside the cluster to Services inside. For ML, Ingress is used to expose model serving endpoints with TLS, path-based routing, and load balancing.
# Ingress for multiple model endpoints
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ml-api-ingress
namespace: ml-serving
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
spec:
ingressClassName: nginx
tls:
- hosts:
- api.ml.example.com
secretName: ml-api-tls
rules:
- host: api.ml.example.com
http:
paths:
- path: /v1/sentiment
pathType: Prefix
backend:
service:
name: sentiment-model-svc
port:
number: 80
- path: /v1/translate
pathType: Prefix
backend:
service:
name: translation-model-svc
port:
number: 80
- path: /v1/embeddings
pathType: Prefix
backend:
service:
name: embedding-model-svc
port:
number: 80
proxy-body-size for endpoints that receive large inputs (images, documents). Increase proxy-read-timeout for models with long inference times (LLMs, complex vision models). Default timeouts will cause 504 errors for slow models.NetworkPolicies for ML Security
NetworkPolicies control traffic flow between Pods. In ML environments, use them to isolate training from serving, restrict access to data stores, and secure inter-service communication.
# Allow only serving Pods to access the model store
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: model-store-access
namespace: ml-serving
spec:
podSelector:
matchLabels:
role: model-store
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: inference-server
ports:
- port: 9000
protocol: TCP
Common ML Network Policies
- Isolate training namespace — Prevent training Pods from accessing the internet (data exfiltration prevention)
- Restrict data access — Only authorized Pods can access dataset storage
- Allow distributed training — Permit inter-Pod communication within a training Job for gradient synchronization
- Expose serving only — Only model serving Pods can receive external traffic via Ingress
Volume Mounts in ML Pods
# Training Pod with multiple volume mounts
apiVersion: v1
kind: Pod
metadata:
name: training-pod
spec:
containers:
- name: trainer
image: myregistry/trainer:v1.0
volumeMounts:
- name: dataset
mountPath: /data
readOnly: true # Dataset is read-only
- name: checkpoints
mountPath: /checkpoints # Save model checkpoints
- name: logs
mountPath: /logs # TensorBoard logs
- name: config
mountPath: /config # Training configuration
volumes:
- name: dataset
persistentVolumeClaim:
claimName: imagenet-dataset
- name: checkpoints
persistentVolumeClaim:
claimName: checkpoint-pvc
- name: logs
emptyDir: {} # Ephemeral, lost when Pod dies
- name: config
configMap:
name: training-config
Practice Questions
A) ReadWriteOnce
B) ReadOnlyMany
C) ReadWriteMany
D) ReadWriteOncePod
Show Answer
B) ReadOnlyMany (ROX). ReadOnlyMany allows multiple nodes to mount the volume in read-only mode simultaneously. Since the training Pods only read the dataset, this is the most appropriate and secure choice. ReadWriteMany would also work but grants unnecessary write permissions.
A) Increase the readiness probe timeout
B) Add an annotation to increase the Ingress proxy-read-timeout
C) Increase the Pod memory limit
D) Add more replicas to the Deployment
Show Answer
B) Add an annotation to increase the Ingress proxy-read-timeout. The 504 error occurs because the Ingress controller times out before the model finishes processing. Add nginx.ingress.kubernetes.io/proxy-read-timeout: "120" to the Ingress annotations. Readiness probes and memory limits are unrelated to request timeout.
reclaimPolicy should the StorageClass use?A) Delete
B) Retain
C) Recycle
D) Archive
Show Answer
B) Retain. The Retain reclaim policy keeps the underlying PersistentVolume and its data when the PVC is deleted. An administrator can then manually recover the data. Delete removes the volume when the PVC is deleted. Recycle is deprecated. Archive does not exist in Kubernetes.
A) Three LoadBalancer Services
B) One Ingress with path-based routing
C) Three NodePort Services
D) One ClusterIP Service
Show Answer
B) One Ingress with path-based routing. Ingress supports path-based routing, directing /v1/sentiment, /v1/translate, and /v1/embeddings to different backend Services through a single external IP and hostname. Three LoadBalancer Services would work but waste IP addresses and cost more on cloud providers.
A) ResourceQuota
B) RBAC Role
C) NetworkPolicy
D) PodSecurityPolicy
Show Answer
C) NetworkPolicy. A NetworkPolicy with an ingress rule that selects the model-store Pods and allows traffic only from Pods labeled app: inference-server on port 9000. RBAC controls API access (who can create/read resources), not network traffic between Pods. ResourceQuotas limit resource consumption, not network access.
Lilly Tech Systems