Intermediate
Google Cloud Run for AI Inference
Serve machine learning models on Cloud Run with GPU acceleration, concurrent request handling, minimum instances, and seamless Vertex AI integration.
Why Cloud Run for ML?
Cloud Run is uniquely positioned for AI inference because it supports GPU acceleration, handles multiple concurrent requests per instance, and allows up to 32 GB of memory. Unlike pure function-as-a-service platforms, Cloud Run runs full containers, giving you complete control over your inference runtime.
Key advantage: Cloud Run supports NVIDIA L4 GPUs in serverless mode. This means you can serve GPU-accelerated models that scale to zero when idle, paying only for actual inference time. This is a game-changer for bursty GPU inference workloads.
Deploying an ML Model on Cloud Run
Python - FastAPI Inference Server
from fastapi import FastAPI import torch from transformers import pipeline app = FastAPI() # Load model at startup (shared across requests) classifier = pipeline( "text-classification", model="distilbert-base-uncased-finetuned-sst-2-english", device=0 if torch.cuda.is_available() else -1 ) @app.post("/predict") async def predict(request: dict): result = classifier(request["text"]) return {"prediction": result}
gcloud - Deploy with GPU
# Deploy to Cloud Run with GPU
gcloud run deploy ml-inference \
--image gcr.io/my-project/ml-inference:latest \
--gpu 1 \
--gpu-type nvidia-l4 \
--memory 16Gi \
--cpu 8 \
--concurrency 10 \
--min-instances 0 \
--max-instances 20 \
--port 8080 \
--region us-central1
Concurrency Optimization
Unlike Lambda which processes one request per instance, Cloud Run can handle multiple concurrent requests per instance. For ML inference, this means you can batch requests together for more efficient GPU utilization.
Cloud Run vs Cloud Functions for ML
| Feature | Cloud Run | Cloud Functions |
|---|---|---|
| GPU Support | Yes (L4) | No |
| Max Memory | 32 GB | 32 GB |
| Concurrency | Up to 1000 per instance | 1 per instance |
| Container Support | Full Docker | Source-based or Docker |
| Startup Time | Slower (full container) | Faster (lightweight) |
Best practice: Set
--concurrency based on your model's memory usage per request. For GPU models, start with concurrency equal to your batch size. For CPU models, set it to match your vCPU count. Monitor and tune based on latency percentiles.
Lilly Tech Systems