Intermediate
KServe: Serverless Model Serving
Deploy models with KServe for serverless autoscaling, scale-to-zero capabilities, canary rollouts, and pre/post-processing transformers on Kubernetes.
Deploying an InferenceService
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: gs://kserve-examples/models/sklearn/1.0/model
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
Canary Deployments
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: my-model
spec:
predictor:
canaryTrafficPercent: 20
model:
modelFormat:
name: pytorch
storageUri: gs://models/v2 # New version gets 20% traffic
resources:
limits:
nvidia.com/gpu: 1
Model Transformers
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: bert-service
spec:
transformer:
containers:
- name: tokenizer
image: my-registry/bert-tokenizer:v1
resources:
limits:
cpu: "2"
memory: "4Gi"
predictor:
model:
modelFormat:
name: pytorch
storageUri: gs://models/bert-v1
resources:
limits:
nvidia.com/gpu: 1
Scale to zero: KServe can scale model pods to zero when there are no requests, saving GPU costs. When a new request arrives, Knative activates the pod (cold start ~10-30 seconds). Disable this for latency-sensitive models.
When to use KServe: KServe is ideal when you need standardized model serving with built-in autoscaling, canary deployments, and multi-framework support. It integrates well with KubeFlow for end-to-end MLOps pipelines.
Lilly Tech Systems