Intermediate

Model Serving as Microservices

Package and deploy ML models as production-ready microservices with optimized inference, proper versioning, and intelligent GPU resource management.

Model Serving Frameworks

Several frameworks simplify packaging models as microservices with production-grade features:

Framework	Strengths	Best For
TensorFlow Serving	High performance, gRPC native	TensorFlow models in production
Triton Inference Server	Multi-framework, GPU optimized	Multi-model serving on NVIDIA GPUs
TorchServe	PyTorch native, easy packaging	PyTorch models with custom handlers
Seldon Core	Kubernetes native, ML pipelines	Complex inference graphs on K8s
BentoML	Framework agnostic, easy packaging	Quick model-to-service packaging

Service Architecture Patterns

Single Model per Service
Each microservice wraps exactly one model. Provides the cleanest separation of concerns and simplest deployment lifecycle, but increases infrastructure overhead.
Multi-Model Service
A single service hosts multiple related models that share preprocessing logic or hardware. Reduces infrastructure cost but couples model lifecycles.
Model Ensemble Service
Combines predictions from multiple models into a single response. Implements voting, averaging, or cascading strategies for improved accuracy.
Inference Pipeline Service
Chains preprocessing, feature extraction, prediction, and postprocessing as a directed acyclic graph within a single service boundary.

✅

Performance Tip: Use asynchronous request processing with request queues to maximize GPU utilization. Dynamic batching can increase throughput by 3-5x while keeping latency within acceptable bounds.

GPU Resource Management

GPU Sharing

Run multiple models on a single GPU using time-slicing or MPS (Multi-Process Service) to improve utilization for smaller models.

GPU Pooling

Maintain a shared pool of GPU instances that services can request from dynamically, avoiding dedicated GPU allocation per service.

Fractional GPUs

Allocate fractions of GPU memory and compute to services using MIG (Multi-Instance GPU) or virtual GPU technologies.

CPU Fallback

Design services to gracefully fall back to CPU inference during GPU shortages, using optimized ONNX Runtime or TensorRT Lite models.

Model Versioning Strategies

Managing multiple model versions is essential for safe deployments and experimentation:

URL-based Versioning: Route to model versions via URL paths like /v1/predict and /v2/predict for explicit version selection
Header-based Versioning: Use custom headers to specify model version, keeping URLs clean while supporting version pinning
Traffic-based Routing: Route traffic percentages to different versions for canary testing without client-side changes
Shadow Mode: Run new versions alongside production, comparing outputs without affecting users, to validate before promotion

💡

Looking Ahead: In the next lesson, we will explore orchestration patterns for coordinating multiple AI microservices, including service mesh, saga patterns, and complex inference pipelines.

← PreviousService Design Next →Orchestration

Model Serving as Microservices

Model Serving Frameworks

Service Architecture Patterns

Single Model per Service

Multi-Model Service

Model Ensemble Service

Inference Pipeline Service

GPU Resource Management

GPU Sharing

GPU Pooling

Fractional GPUs

CPU Fallback

Model Versioning Strategies