Intermediate

KServe: Serverless Model Serving

Deploy models with KServe for serverless autoscaling, scale-to-zero capabilities, canary rollouts, and pre/post-processing transformers on Kubernetes.

Deploying an InferenceService

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: gs://kserve-examples/models/sklearn/1.0/model
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"
        limits:
          cpu: "2"
          memory: "4Gi"

Canary Deployments

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    canaryTrafficPercent: 20
    model:
      modelFormat:
        name: pytorch
      storageUri: gs://models/v2  # New version gets 20% traffic
      resources:
        limits:
          nvidia.com/gpu: 1

Model Transformers

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: bert-service
spec:
  transformer:
    containers:
    - name: tokenizer
      image: my-registry/bert-tokenizer:v1
      resources:
        limits:
          cpu: "2"
          memory: "4Gi"
  predictor:
    model:
      modelFormat:
        name: pytorch
      storageUri: gs://models/bert-v1
      resources:
        limits:
          nvidia.com/gpu: 1

💡

Scale to zero: KServe can scale model pods to zero when there are no requests, saving GPU costs. When a new request arrives, Knative activates the pod (cold start ~10-30 seconds). Disable this for latency-sensitive models.

✅

When to use KServe: KServe is ideal when you need standardized model serving with built-in autoscaling, canary deployments, and multi-framework support. It integrates well with KubeFlow for end-to-end MLOps pipelines.

← Previous NVIDIA Triton Next → Seldon Core