Kubernetes Security for ML Workloads
Kubernetes is the dominant orchestration platform for ML at scale. Securing ML workloads requires proper RBAC, pod security standards, network policies, and GPU-aware scheduling controls.
Pod Security Standards for ML
Kubernetes Pod Security Standards (PSS) define three levels of security. ML workloads should target the restricted profile wherever possible:
| Profile | ML Use Case | Restrictions |
|---|---|---|
| Privileged | GPU driver installation (DaemonSet only) | None — unrestricted access |
| Baseline | Training jobs requiring host networking or IPC | Prevents known privilege escalations |
| Restricted | Model inference, data preprocessing, API serving | Enforces hardening best practices |
RBAC for ML Namespaces
Implement role-based access control that separates concerns across ML teams:
-
Namespace Isolation
Create dedicated namespaces for training, inference, and data processing. Apply resource quotas per namespace to prevent a single training job from consuming all cluster GPU resources.
-
Service Account Scoping
Each ML workload type should have its own service account with minimal permissions. Training jobs need access to data volumes and model storage. Inference pods need only read access to model artifacts.
-
Secret Access Control
Use RBAC to restrict which service accounts can read which secrets. Data pipeline credentials should not be accessible from inference pods.
-
Audit Logging
Enable Kubernetes audit logging for all API server requests in ML namespaces. Track who created, modified, or accessed GPU workloads and their associated secrets.
Network Policies for ML
ML clusters have specific network security requirements:
- Training isolation: Multi-GPU training pods need to communicate with each other (NCCL, Gloo) but should not reach the internet or unrelated services
- Inference lockdown: Model serving pods should only accept traffic from the API gateway and reach the model storage backend
- Data pipeline controls: Restrict data preprocessing pods to only access approved data sources and the training namespace
- Egress filtering: Block outbound internet access from ML pods except for explicitly whitelisted endpoints (model registries, package mirrors)
GPU Scheduling Security
Node Affinity
Use node labels and affinity rules to ensure sensitive ML workloads (e.g., training on proprietary data) run on dedicated GPU nodes separate from shared workloads.
Taints and Tolerations
Apply taints to GPU nodes so that only authorized ML workloads with matching tolerations can be scheduled there. Prevents non-ML pods from landing on expensive GPU hardware.
Resource Quotas
Set per-namespace GPU quotas to prevent any single team from monopolizing cluster GPU resources. Use priority classes for critical inference workloads.
MIG Support
On supported GPUs (A100, H100), use NVIDIA MIG to partition GPUs into isolated instances. Each partition has its own memory and compute, providing hardware-level isolation.
Lilly Tech Systems