Intermediate

Model Serving as Microservices

Package and deploy ML models as production-ready microservices with optimized inference, proper versioning, and intelligent GPU resource management.

Model Serving Frameworks

Several frameworks simplify packaging models as microservices with production-grade features:

FrameworkStrengthsBest For
TensorFlow ServingHigh performance, gRPC nativeTensorFlow models in production
Triton Inference ServerMulti-framework, GPU optimizedMulti-model serving on NVIDIA GPUs
TorchServePyTorch native, easy packagingPyTorch models with custom handlers
Seldon CoreKubernetes native, ML pipelinesComplex inference graphs on K8s
BentoMLFramework agnostic, easy packagingQuick model-to-service packaging

Service Architecture Patterns

  1. Single Model per Service

    Each microservice wraps exactly one model. Provides the cleanest separation of concerns and simplest deployment lifecycle, but increases infrastructure overhead.

  2. Multi-Model Service

    A single service hosts multiple related models that share preprocessing logic or hardware. Reduces infrastructure cost but couples model lifecycles.

  3. Model Ensemble Service

    Combines predictions from multiple models into a single response. Implements voting, averaging, or cascading strategies for improved accuracy.

  4. Inference Pipeline Service

    Chains preprocessing, feature extraction, prediction, and postprocessing as a directed acyclic graph within a single service boundary.

Performance Tip: Use asynchronous request processing with request queues to maximize GPU utilization. Dynamic batching can increase throughput by 3-5x while keeping latency within acceptable bounds.

GPU Resource Management

GPU Sharing

Run multiple models on a single GPU using time-slicing or MPS (Multi-Process Service) to improve utilization for smaller models.

GPU Pooling

Maintain a shared pool of GPU instances that services can request from dynamically, avoiding dedicated GPU allocation per service.

Fractional GPUs

Allocate fractions of GPU memory and compute to services using MIG (Multi-Instance GPU) or virtual GPU technologies.

CPU Fallback

Design services to gracefully fall back to CPU inference during GPU shortages, using optimized ONNX Runtime or TensorRT Lite models.

Model Versioning Strategies

Managing multiple model versions is essential for safe deployments and experimentation:

  • URL-based Versioning: Route to model versions via URL paths like /v1/predict and /v2/predict for explicit version selection
  • Header-based Versioning: Use custom headers to specify model version, keeping URLs clean while supporting version pinning
  • Traffic-based Routing: Route traffic percentages to different versions for canary testing without client-side changes
  • Shadow Mode: Run new versions alongside production, comparing outputs without affecting users, to validate before promotion
💡
Looking Ahead: In the next lesson, we will explore orchestration patterns for coordinating multiple AI microservices, including service mesh, saga patterns, and complex inference pipelines.