Advanced

Reliability & Fault Tolerance

AI models fail in ways that traditional software does not: they time out under load, produce confidently wrong answers, degrade silently as data drifts, and crash when they encounter inputs outside their training distribution. This lesson covers the patterns that keep AI services running when things go wrong.

Graceful Degradation When Models Fail

The goal is not to prevent all failures — that is impossible. The goal is to ensure that when a component fails, the system continues to provide value, even if at reduced quality.

Tier 1: Full AI (Normal Operation)

Primary model serving real-time personalized predictions. Best quality, highest cost. Example: personalized product recommendations using collaborative filtering + deep ranking model.

Tier 2: Simpler Model (Primary Model Down)

Fallback to a lighter model that is faster and more reliable but less accurate. Example: switch from deep ranking model to gradient-boosted tree that uses fewer features. 85% of the quality at 10% of the compute.

Tier 3: Rule-Based (All Models Down)

Hand-coded business rules that require no ML inference. Example: show most popular items in the user's country + category. No personalization, but still useful.

Tier 4: Cached Responses (Everything Down)

Serve stale but previously valid predictions from cache. Example: show the recommendations that were computed for this user yesterday. Better than nothing.

# Production fallback chain implementation

class ResilientPredictor:
    def __init__(self):
        self.primary_model = GPUModelClient("triton:8001", timeout_ms=100)
        self.fallback_model = CPUModelClient("lightgbm-server:8080", timeout_ms=30)
        self.rule_engine = RuleBasedRecommender()
        self.cache = Redis(host="redis:6379")

    async def predict(self, user_id: str, context: dict) -> PredictionResult:
        # Tier 1: Try primary model
        try:
            result = await self.primary_model.predict(user_id, context)
            # Cache successful result for Tier 4 fallback
            self.cache.setex(f"pred:{user_id}", 86400, json.dumps(result))
            return PredictionResult(items=result, tier="primary", quality="full")
        except (TimeoutError, ModelUnavailableError) as e:
            metrics.increment("fallback.tier1_failed", tags={"error": type(e).__name__})

        # Tier 2: Try simpler model
        try:
            result = await self.fallback_model.predict(user_id, context)
            return PredictionResult(items=result, tier="fallback", quality="reduced")
        except Exception as e:
            metrics.increment("fallback.tier2_failed")

        # Tier 3: Rule-based
        try:
            result = self.rule_engine.recommend(context.get("category"), context.get("country"))
            return PredictionResult(items=result, tier="rules", quality="generic")
        except Exception:
            metrics.increment("fallback.tier3_failed")

        # Tier 4: Cached response
        cached = self.cache.get(f"pred:{user_id}")
        if cached:
            return PredictionResult(items=json.loads(cached), tier="cache", quality="stale")

        # Tier 5: Static default (absolute last resort)
        return PredictionResult(items=GLOBAL_TOP_20, tier="static", quality="minimal")

⚠

Always test your fallbacks: Most teams build fallback chains but never test them. Run chaos engineering exercises monthly: kill the primary model and verify that Tier 2/3/4 actually work. Many teams discover during an outage that their fallback path has been broken for months because a dependency changed.

Circuit Breaker Pattern for AI Services

The circuit breaker prevents cascading failures. When the model server starts failing, the circuit breaker stops sending requests to it (giving it time to recover) and routes traffic to fallbacks instead.

# Circuit breaker for model inference

class CircuitBreaker:
    """
    States:
    - CLOSED: Normal operation, requests go to the model
    - OPEN: Model is failing, requests go to fallback immediately
    - HALF_OPEN: Testing if model has recovered
    """
    def __init__(self, failure_threshold=5, recovery_timeout=30, half_open_max=3):
        self.state = "CLOSED"
        self.failure_count = 0
        self.failure_threshold = failure_threshold   # Open after N failures
        self.recovery_timeout = recovery_timeout     # Seconds before trying again
        self.half_open_max = half_open_max           # Test requests in half-open
        self.last_failure_time = None
        self.half_open_successes = 0

    async def call(self, func, *args, **kwargs):
        if self.state == "OPEN":
            if self._should_attempt_reset():
                self.state = "HALF_OPEN"
                self.half_open_successes = 0
            else:
                raise CircuitOpenError(
                    f"Circuit open, retry in {self._time_until_retry()}s"
                )

        try:
            result = await asyncio.wait_for(func(*args, **kwargs), timeout=0.1)
            self._on_success()
            return result
        except (TimeoutError, ConnectionError, ModelError) as e:
            self._on_failure()
            raise

    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = "OPEN"
            metrics.increment("circuit_breaker.opened")
            alert("Circuit breaker OPENED for model service")

    def _on_success(self):
        if self.state == "HALF_OPEN":
            self.half_open_successes += 1
            if self.half_open_successes >= self.half_open_max:
                self.state = "CLOSED"
                self.failure_count = 0
                metrics.increment("circuit_breaker.closed")
        else:
            self.failure_count = 0

    def _should_attempt_reset(self):
        return time.time() - self.last_failure_time > self.recovery_timeout

# Usage:
breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)

async def get_prediction(user_id):
    try:
        return await breaker.call(primary_model.predict, user_id)
    except CircuitOpenError:
        return await fallback_model.predict(user_id)  # Skip primary entirely

Health Checks for Model Endpoints

AI health checks go beyond "is the server responding?" You need to verify that the model is loaded, warm, and producing reasonable outputs.

# Three-level health check for AI services

@app.get("/health/live")
async def liveness():
    """Is the process alive? (Kubernetes liveness probe)"""
    return {"status": "alive"}

@app.get("/health/ready")
async def readiness():
    """Is the model loaded and ready to serve? (Kubernetes readiness probe)"""
    checks = {
        "model_loaded": model_server.is_model_loaded(),
        "gpu_available": torch.cuda.is_available(),
        "gpu_memory_ok": torch.cuda.memory_allocated() < 0.95 * torch.cuda.get_device_properties(0).total_mem,
        "feature_store_connected": feature_store.ping(),
    }
    all_ok = all(checks.values())
    return JSONResponse(
        status_code=200 if all_ok else 503,
        content={"status": "ready" if all_ok else "not_ready", "checks": checks}
    )

@app.get("/health/inference")
async def inference_health():
    """Can the model produce valid predictions? (Deep health check, run every 60s)"""
    try:
        # Run inference on a known test input
        test_input = load_canary_input()  # Pre-defined input with known expected output
        start = time.time()
        result = await model.predict(test_input)
        latency_ms = (time.time() - start) * 1000

        checks = {
            "inference_works": result is not None,
            "output_valid": validate_output_schema(result),
            "latency_ok": latency_ms < 200,  # Under 200ms
            "output_reasonable": is_output_in_expected_range(result, test_input),
            "latency_ms": round(latency_ms, 1)
        }
        all_ok = all(v for k, v in checks.items() if k != "latency_ms")
        return JSONResponse(
            status_code=200 if all_ok else 503,
            content={"status": "healthy" if all_ok else "degraded", "checks": checks}
        )
    except Exception as e:
        return JSONResponse(
            status_code=503,
            content={"status": "unhealthy", "error": str(e)}
        )

Disaster Recovery for ML Systems

ML disaster recovery requires recovering not just code and data, but also model artifacts, feature pipelines, and serving infrastructure.

Component	Backup Strategy	Recovery Time	Recovery Point
Model artifacts	Stored in S3/GCS with versioning, replicated to second region	5–15 minutes (download + load)	Latest version (RPO: 0)
Feature store (online)	Redis cluster with replication, or multi-AZ DynamoDB	Automatic failover (< 1 min)	Last sync (RPO: seconds)
Feature store (offline)	BigQuery/S3 cross-region replication	30 min – 2 hours	Last replication (RPO: 1–24 hours)
Training pipeline	Code in Git, data in versioned storage	4–8 hours (rebuild from scratch)	Last commit + data version
Inference config	GitOps (Kubernetes manifests in Git)	10–30 minutes (apply manifests)	Last merge to main

SLA Design for AI Services

AI SLAs are harder to define than traditional SLAs because the system can be "up" but producing bad results. You need SLAs that cover both availability and quality.

Availability SLA

99.9% uptime = 8.7 hours downtime/year. For AI services, "up" means the service returns a response (from any tier). Even if the primary model is down, serving cached or rule-based results counts as "up." 99.5% is more realistic for GPU-dependent services (43.8 hours/year).

Quality SLA

95% of responses from Tier 1 or Tier 2. Allows 5% of responses to come from rules or cache during incidents. Track tier_distribution as a metric. Alert when Tier 1 percentage drops below 90%.

Latency SLA

p99 < 150ms for the primary model path. This SLA applies only when serving from Tier 1. Fallback tiers may have different (usually better) latency characteristics. Include latency SLA in your API documentation.

Freshness SLA

Model retrained within 24 hours of new data. Features updated within 1 hour. If real-time features are stale by more than 5 minutes, switch to batch features. Track feature_staleness as a metric.

💡

Apply at work tomorrow: Ask yourself: "What happens right now when our model endpoint goes down?" If the answer is "the whole feature breaks" or "I don't know," you need a fallback chain. Start with the simplest possible fallback — even a static response is better than a 500 error. Then work backward to build out Tiers 2 and 3.

← Previous Scaling AI Systems Next → Cost-Aware Architecture