Model Serving as Microservices
Package and deploy ML models as production-ready microservices with optimized inference, proper versioning, and intelligent GPU resource management.
Model Serving Frameworks
Several frameworks simplify packaging models as microservices with production-grade features:
| Framework | Strengths | Best For |
|---|---|---|
| TensorFlow Serving | High performance, gRPC native | TensorFlow models in production |
| Triton Inference Server | Multi-framework, GPU optimized | Multi-model serving on NVIDIA GPUs |
| TorchServe | PyTorch native, easy packaging | PyTorch models with custom handlers |
| Seldon Core | Kubernetes native, ML pipelines | Complex inference graphs on K8s |
| BentoML | Framework agnostic, easy packaging | Quick model-to-service packaging |
Service Architecture Patterns
Single Model per Service
Each microservice wraps exactly one model. Provides the cleanest separation of concerns and simplest deployment lifecycle, but increases infrastructure overhead.
Multi-Model Service
A single service hosts multiple related models that share preprocessing logic or hardware. Reduces infrastructure cost but couples model lifecycles.
Model Ensemble Service
Combines predictions from multiple models into a single response. Implements voting, averaging, or cascading strategies for improved accuracy.
Inference Pipeline Service
Chains preprocessing, feature extraction, prediction, and postprocessing as a directed acyclic graph within a single service boundary.
GPU Resource Management
GPU Sharing
Run multiple models on a single GPU using time-slicing or MPS (Multi-Process Service) to improve utilization for smaller models.
GPU Pooling
Maintain a shared pool of GPU instances that services can request from dynamically, avoiding dedicated GPU allocation per service.
Fractional GPUs
Allocate fractions of GPU memory and compute to services using MIG (Multi-Instance GPU) or virtual GPU technologies.
CPU Fallback
Design services to gracefully fall back to CPU inference during GPU shortages, using optimized ONNX Runtime or TensorRT Lite models.
Model Versioning Strategies
Managing multiple model versions is essential for safe deployments and experimentation:
- URL-based Versioning: Route to model versions via URL paths like
/v1/predictand/v2/predictfor explicit version selection - Header-based Versioning: Use custom headers to specify model version, keeping URLs clean while supporting version pinning
- Traffic-based Routing: Route traffic percentages to different versions for canary testing without client-side changes
- Shadow Mode: Run new versions alongside production, comparing outputs without affecting users, to validate before promotion
Lilly Tech Systems