Reliability & Fault Tolerance
AI models fail in ways that traditional software does not: they time out under load, produce confidently wrong answers, degrade silently as data drifts, and crash when they encounter inputs outside their training distribution. This lesson covers the patterns that keep AI services running when things go wrong.
Graceful Degradation When Models Fail
The goal is not to prevent all failures — that is impossible. The goal is to ensure that when a component fails, the system continues to provide value, even if at reduced quality.
Tier 1: Full AI (Normal Operation)
Primary model serving real-time personalized predictions. Best quality, highest cost. Example: personalized product recommendations using collaborative filtering + deep ranking model.
Tier 2: Simpler Model (Primary Model Down)
Fallback to a lighter model that is faster and more reliable but less accurate. Example: switch from deep ranking model to gradient-boosted tree that uses fewer features. 85% of the quality at 10% of the compute.
Tier 3: Rule-Based (All Models Down)
Hand-coded business rules that require no ML inference. Example: show most popular items in the user's country + category. No personalization, but still useful.
Tier 4: Cached Responses (Everything Down)
Serve stale but previously valid predictions from cache. Example: show the recommendations that were computed for this user yesterday. Better than nothing.
# Production fallback chain implementation
class ResilientPredictor:
def __init__(self):
self.primary_model = GPUModelClient("triton:8001", timeout_ms=100)
self.fallback_model = CPUModelClient("lightgbm-server:8080", timeout_ms=30)
self.rule_engine = RuleBasedRecommender()
self.cache = Redis(host="redis:6379")
async def predict(self, user_id: str, context: dict) -> PredictionResult:
# Tier 1: Try primary model
try:
result = await self.primary_model.predict(user_id, context)
# Cache successful result for Tier 4 fallback
self.cache.setex(f"pred:{user_id}", 86400, json.dumps(result))
return PredictionResult(items=result, tier="primary", quality="full")
except (TimeoutError, ModelUnavailableError) as e:
metrics.increment("fallback.tier1_failed", tags={"error": type(e).__name__})
# Tier 2: Try simpler model
try:
result = await self.fallback_model.predict(user_id, context)
return PredictionResult(items=result, tier="fallback", quality="reduced")
except Exception as e:
metrics.increment("fallback.tier2_failed")
# Tier 3: Rule-based
try:
result = self.rule_engine.recommend(context.get("category"), context.get("country"))
return PredictionResult(items=result, tier="rules", quality="generic")
except Exception:
metrics.increment("fallback.tier3_failed")
# Tier 4: Cached response
cached = self.cache.get(f"pred:{user_id}")
if cached:
return PredictionResult(items=json.loads(cached), tier="cache", quality="stale")
# Tier 5: Static default (absolute last resort)
return PredictionResult(items=GLOBAL_TOP_20, tier="static", quality="minimal")
Circuit Breaker Pattern for AI Services
The circuit breaker prevents cascading failures. When the model server starts failing, the circuit breaker stops sending requests to it (giving it time to recover) and routes traffic to fallbacks instead.
# Circuit breaker for model inference
class CircuitBreaker:
"""
States:
- CLOSED: Normal operation, requests go to the model
- OPEN: Model is failing, requests go to fallback immediately
- HALF_OPEN: Testing if model has recovered
"""
def __init__(self, failure_threshold=5, recovery_timeout=30, half_open_max=3):
self.state = "CLOSED"
self.failure_count = 0
self.failure_threshold = failure_threshold # Open after N failures
self.recovery_timeout = recovery_timeout # Seconds before trying again
self.half_open_max = half_open_max # Test requests in half-open
self.last_failure_time = None
self.half_open_successes = 0
async def call(self, func, *args, **kwargs):
if self.state == "OPEN":
if self._should_attempt_reset():
self.state = "HALF_OPEN"
self.half_open_successes = 0
else:
raise CircuitOpenError(
f"Circuit open, retry in {self._time_until_retry()}s"
)
try:
result = await asyncio.wait_for(func(*args, **kwargs), timeout=0.1)
self._on_success()
return result
except (TimeoutError, ConnectionError, ModelError) as e:
self._on_failure()
raise
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
metrics.increment("circuit_breaker.opened")
alert("Circuit breaker OPENED for model service")
def _on_success(self):
if self.state == "HALF_OPEN":
self.half_open_successes += 1
if self.half_open_successes >= self.half_open_max:
self.state = "CLOSED"
self.failure_count = 0
metrics.increment("circuit_breaker.closed")
else:
self.failure_count = 0
def _should_attempt_reset(self):
return time.time() - self.last_failure_time > self.recovery_timeout
# Usage:
breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
async def get_prediction(user_id):
try:
return await breaker.call(primary_model.predict, user_id)
except CircuitOpenError:
return await fallback_model.predict(user_id) # Skip primary entirely
Health Checks for Model Endpoints
AI health checks go beyond "is the server responding?" You need to verify that the model is loaded, warm, and producing reasonable outputs.
# Three-level health check for AI services
@app.get("/health/live")
async def liveness():
"""Is the process alive? (Kubernetes liveness probe)"""
return {"status": "alive"}
@app.get("/health/ready")
async def readiness():
"""Is the model loaded and ready to serve? (Kubernetes readiness probe)"""
checks = {
"model_loaded": model_server.is_model_loaded(),
"gpu_available": torch.cuda.is_available(),
"gpu_memory_ok": torch.cuda.memory_allocated() < 0.95 * torch.cuda.get_device_properties(0).total_mem,
"feature_store_connected": feature_store.ping(),
}
all_ok = all(checks.values())
return JSONResponse(
status_code=200 if all_ok else 503,
content={"status": "ready" if all_ok else "not_ready", "checks": checks}
)
@app.get("/health/inference")
async def inference_health():
"""Can the model produce valid predictions? (Deep health check, run every 60s)"""
try:
# Run inference on a known test input
test_input = load_canary_input() # Pre-defined input with known expected output
start = time.time()
result = await model.predict(test_input)
latency_ms = (time.time() - start) * 1000
checks = {
"inference_works": result is not None,
"output_valid": validate_output_schema(result),
"latency_ok": latency_ms < 200, # Under 200ms
"output_reasonable": is_output_in_expected_range(result, test_input),
"latency_ms": round(latency_ms, 1)
}
all_ok = all(v for k, v in checks.items() if k != "latency_ms")
return JSONResponse(
status_code=200 if all_ok else 503,
content={"status": "healthy" if all_ok else "degraded", "checks": checks}
)
except Exception as e:
return JSONResponse(
status_code=503,
content={"status": "unhealthy", "error": str(e)}
)
Disaster Recovery for ML Systems
ML disaster recovery requires recovering not just code and data, but also model artifacts, feature pipelines, and serving infrastructure.
| Component | Backup Strategy | Recovery Time | Recovery Point |
|---|---|---|---|
| Model artifacts | Stored in S3/GCS with versioning, replicated to second region | 5–15 minutes (download + load) | Latest version (RPO: 0) |
| Feature store (online) | Redis cluster with replication, or multi-AZ DynamoDB | Automatic failover (< 1 min) | Last sync (RPO: seconds) |
| Feature store (offline) | BigQuery/S3 cross-region replication | 30 min – 2 hours | Last replication (RPO: 1–24 hours) |
| Training pipeline | Code in Git, data in versioned storage | 4–8 hours (rebuild from scratch) | Last commit + data version |
| Inference config | GitOps (Kubernetes manifests in Git) | 10–30 minutes (apply manifests) | Last merge to main |
SLA Design for AI Services
AI SLAs are harder to define than traditional SLAs because the system can be "up" but producing bad results. You need SLAs that cover both availability and quality.
Availability SLA
99.9% uptime = 8.7 hours downtime/year. For AI services, "up" means the service returns a response (from any tier). Even if the primary model is down, serving cached or rule-based results counts as "up." 99.5% is more realistic for GPU-dependent services (43.8 hours/year).
Quality SLA
95% of responses from Tier 1 or Tier 2. Allows 5% of responses to come from rules or cache during incidents. Track tier_distribution as a metric. Alert when Tier 1 percentage drops below 90%.
Latency SLA
p99 < 150ms for the primary model path. This SLA applies only when serving from Tier 1. Fallback tiers may have different (usually better) latency characteristics. Include latency SLA in your API documentation.
Freshness SLA
Model retrained within 24 hours of new data. Features updated within 1 hour. If real-time features are stale by more than 5 minutes, switch to batch features. Track feature_staleness as a metric.
Lilly Tech Systems