Intermediate

Cascade & Fallback Pattern

The cascade pattern starts with the cheapest option and escalates only when needed. The fallback pattern switches to a backup when the primary fails. Together, they let you build AI systems that are both cost-efficient and highly reliable.

What Is the Cascade Pattern?

The cascade pattern processes a request by trying the cheapest or simplest approach first and escalating to more expensive approaches only when the cheaper one isn't confident enough. It is fundamentally an optimization pattern: you pay for the expensive model only when you need it.

Imagine a customer support system. Most questions are simple: "What are your business hours?" or "How do I reset my password?" These can be answered by a rule engine or a tiny model for fractions of a cent. Only the complex, nuanced questions — "I was charged twice and my account is locked but I need access for a presentation tomorrow" — need the full power of a large language model.

The cascade pattern formalizes this intuition into a systematic architecture. Each tier in the cascade has a confidence threshold. If the tier can handle the request with sufficient confidence, it returns the result. If not, the request cascades to the next tier.

Confidence-Based Escalation

The critical design decision in any cascade is how you measure confidence. Common approaches include:

Model self-reported confidence: Ask the model to rate its confidence on a 0-1 scale. Simple but unreliable — models are often poorly calibrated and may be overconfident.
Pattern matching: If the response contains hedging language ("I'm not sure", "it depends", "I think"), treat it as low confidence.
Structural validation: Check if the response matches the expected format. A well-structured, complete response suggests higher confidence than a partial or malformed one.
Verifier model: Use a separate model to evaluate the quality of the response. More reliable but adds latency and cost.
Consistency check: Run the same request twice; if responses agree, confidence is higher. Expensive but effective for critical applications.

💡

Confidence thresholds need tuning. Set thresholds too high and everything cascades to the expensive model, negating the cost savings. Set them too low and poor-quality responses slip through. Start with conservative (high) thresholds and gradually lower them as you collect data on each tier's actual performance.

Cost Optimization Through Cascading

The financial impact of cascading can be dramatic. Consider a system processing 100,000 requests per day. Without cascading, every request goes to Claude Sonnet. With a well-tuned cascade, 60% of requests are handled by regex or a small model, 30% by Sonnet, and only 10% escalate to Opus.

The math is compelling:

Without cascade: 100,000 requests at ~$0.01 each = $1,000/day
With cascade: 60,000 at $0.0001 + 30,000 at $0.01 + 10,000 at $0.05 = $6 + $300 + $500 = $806/day
But wait: Many of those 60,000 cheap responses are instant (regex), improving user experience too

In practice, well-tuned cascades typically save 30-70% on model costs while maintaining or improving response quality, because each tier is optimized for its specific difficulty level.

Cascade vs Router: When to Use Which

The cascade and router patterns may seem similar — both direct requests to different models — but they solve different problems:

Router: Decides upfront which model to use based on input characteristics. The decision is made once, before any model is called. Best when you can reliably predict the right model from the input alone.
Cascade: Tries models in sequence, escalating based on output quality. Multiple models may be called for a single request. Best when you can't predict difficulty upfront but can evaluate output quality.

In practice, many systems combine both: a router handles obvious cases (clearly simple or clearly complex), and the ambiguous middle gets cascaded.

Code Example: 3-Tier Cascade

This cascade tries regex rules first (instant, free), then a small model (fast, cheap), and finally a large model (slow, expensive). Each tier evaluates whether it can handle the request confidently:

Python

import anthropic
import re
import json
import time

client = anthropic.Anthropic()

# Tier 1: Regex / rule-based responses (instant, free)
FAQ_PATTERNS = {
    r"(?i)(business|office)\s*hours?": "Our business hours are Monday-Friday, 9 AM - 6 PM EST.",
    r"(?i)reset\s*(my\s*)?password": "To reset your password, visit account.example.com/reset and follow the instructions.",
    r"(?i)(refund|return)\s*policy": "We offer full refunds within 30 days of purchase. Visit example.com/returns to start a return.",
    r"(?i)contact\s*(support|us|help)": "You can reach support at support@example.com or call 1-800-EXAMPLE.",
    r"(?i)(cancel|cancellation)": "To cancel your subscription, go to Settings > Subscription > Cancel. You'll retain access until the billing period ends.",
    r"(?i)^(hi|hello|hey|good\s*(morning|afternoon|evening))": "Hello! How can I help you today?",
}

def tier1_regex(query: str) -> dict | None:
    """Tier 1: Pattern matching for common queries."""
    for pattern, response in FAQ_PATTERNS.items():
        if re.search(pattern, query):
            return {
                "tier": 1,
                "method": "regex",
                "response": response,
                "confidence": 0.95,
                "cost": 0.0,
                "latency_ms": 0
            }
    return None  # Can't handle -> cascade to tier 2

def tier2_small_model(query: str) -> dict | None:
    """Tier 2: Small model for straightforward questions."""
    start = time.time()
    response = client.messages.create(
        model="claude-haiku-4-20250514",
        max_tokens=1000,
        system=(
            "Answer the user's question concisely. "
            "If you're not confident in your answer, start with 'UNSURE:' "
            "and explain why."
        ),
        messages=[{"role": "user", "content": query}]
    )
    latency = (time.time() - start) * 1000
    text = response.content[0].text

    # Check confidence signals
    if text.startswith("UNSURE:"):
        return None  # Not confident -> cascade to tier 3

    # Check for hedging language
    hedging_phrases = ["i'm not sure", "i think", "it depends", "possibly",
                       "i don't have enough", "i cannot determine"]
    hedging_count = sum(1 for phrase in hedging_phrases if phrase in text.lower())

    if hedging_count >= 2:
        return None  # Too much uncertainty -> cascade

    return {
        "tier": 2,
        "method": "haiku",
        "response": text,
        "confidence": max(0.5, 0.9 - hedging_count * 0.15),
        "cost": (response.usage.input_tokens + response.usage.output_tokens) * 0.00000025,
        "latency_ms": round(latency)
    }

def tier3_large_model(query: str) -> dict:
    """Tier 3: Large model for complex questions (always returns)."""
    start = time.time()
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4000,
        system="Provide a thorough, well-structured answer. Be helpful and precise.",
        messages=[{"role": "user", "content": query}]
    )
    latency = (time.time() - start) * 1000

    return {
        "tier": 3,
        "method": "sonnet",
        "response": response.content[0].text,
        "confidence": 0.95,
        "cost": (response.usage.input_tokens + response.usage.output_tokens) * 0.000003,
        "latency_ms": round(latency)
    }

def cascade(query: str) -> dict:
    """Execute the 3-tier cascade."""
    # Try each tier in order
    result = tier1_regex(query)
    if result:
        print(f"[Cascade] Handled at Tier 1 (regex) - {result['latency_ms']}ms")
        return result

    result = tier2_small_model(query)
    if result:
        print(f"[Cascade] Handled at Tier 2 (Haiku) - {result['latency_ms']}ms")
        return result

    result = tier3_large_model(query)
    print(f"[Cascade] Handled at Tier 3 (Sonnet) - {result['latency_ms']}ms")
    return result

# Usage examples
print(cascade("What are your business hours?"))      # Tier 1
print(cascade("How do I sort a list in Python?"))     # Tier 2
print(cascade("Compare CQRS vs event sourcing..."))   # Tier 3

Code Example: Quality-Aware Cascade with Self-Evaluation

This more sophisticated cascade uses the model itself to evaluate whether its response is good enough, and escalates if the self-evaluation fails:

Python

import anthropic
import json

client = anthropic.Anthropic()

def self_evaluate(query: str, response: str, model: str) -> dict:
    """Ask a model to evaluate the quality of a response."""
    eval_result = client.messages.create(
        model=model,
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": (
                "Evaluate this response to the given question.\n\n"
                f"Question: {query}\n\n"
                f"Response: {response}\n\n"
                "Rate on these criteria (1-5 each):\n"
                "- accuracy: Is the information correct?\n"
                "- completeness: Does it fully answer the question?\n"
                "- clarity: Is it well-written and easy to understand?\n\n"
                'Return JSON: {"accuracy": N, "completeness": N, '
                '"clarity": N, "overall": N, "should_escalate": bool}'
            )
        }]
    )
    return json.loads(eval_result.content[0].text)

def quality_cascade(query: str, min_quality: float = 3.5) -> dict:
    """Cascade with quality self-evaluation at each tier."""
    tiers = [
        ("claude-haiku-4-20250514", "fast"),
        ("claude-sonnet-4-20250514", "balanced"),
        ("claude-opus-4-20250514", "powerful")
    ]

    for i, (model, tier_name) in enumerate(tiers):
        # Generate response
        response = client.messages.create(
            model=model,
            max_tokens=4000,
            messages=[{"role": "user", "content": query}]
        )
        text = response.content[0].text

        # Last tier: return without evaluation (no escalation possible)
        if i == len(tiers) - 1:
            return {
                "tier": tier_name, "model": model,
                "response": text, "escalation_reason": None
            }

        # Evaluate quality (use same model for speed)
        evaluation = self_evaluate(query, text, model)

        if evaluation["overall"] >= min_quality and not evaluation["should_escalate"]:
            return {
                "tier": tier_name, "model": model,
                "response": text, "quality": evaluation,
                "escalation_reason": None
            }
        else:
            print(f"[Cascade] {tier_name} scored {evaluation['overall']}/5, "
                  f"escalating to next tier...")

    # Should never reach here, but safety fallback
    return {"error": "All tiers exhausted"}

The Fallback Pattern

While the cascade is about cost optimization (cheap first, expensive if needed), the fallback pattern is about reliability (primary fails, backup takes over). The trigger is different: cascades escalate based on quality; fallbacks activate based on failures.

Common failure modes that trigger fallbacks:

API errors: 500 errors, timeouts, rate limits from the primary provider
Content filtering: Primary model refuses the request due to safety filters, but the request is legitimate
Capacity issues: Primary model is overloaded and response times are unacceptable
Regional outages: Provider has an outage in a specific region
Model deprecation: Primary model version is retired or degraded

Circuit Breaker Integration

The circuit breaker pattern prevents your system from repeatedly calling a failing service. When errors exceed a threshold, the circuit "opens" and all requests are immediately routed to the fallback without even trying the primary. After a cooling period, the circuit moves to "half-open" — a few test requests are sent to the primary. If they succeed, the circuit closes and normal routing resumes.

Python

import time
from dataclasses import dataclass, field

@dataclass
class CircuitBreaker:
    """Circuit breaker for AI model calls."""
    failure_threshold: int = 5       # Failures before opening
    recovery_timeout: int = 60       # Seconds before trying again
    half_open_max_calls: int = 3     # Test calls in half-open state

    _failures: int = field(default=0, init=False)
    _state: str = field(default="closed", init=False)    # closed, open, half-open
    _last_failure_time: float = field(default=0, init=False)
    _half_open_successes: int = field(default=0, init=False)

    def can_execute(self) -> bool:
        """Check if the circuit allows a request."""
        if self._state == "closed":
            return True

        if self._state == "open":
            # Check if recovery timeout has elapsed
            if time.time() - self._last_failure_time > self.recovery_timeout:
                self._state = "half-open"
                self._half_open_successes = 0
                print("[Circuit] Moving to half-open state")
                return True
            return False

        # half-open: allow limited test calls
        return True

    def record_success(self):
        """Record a successful call."""
        if self._state == "half-open":
            self._half_open_successes += 1
            if self._half_open_successes >= self.half_open_max_calls:
                self._state = "closed"
                self._failures = 0
                print("[Circuit] Closed - service recovered")
        else:
            self._failures = max(0, self._failures - 1)

    def record_failure(self):
        """Record a failed call."""
        self._failures += 1
        self._last_failure_time = time.time()
        if self._failures >= self.failure_threshold:
            self._state = "open"
            print(f"[Circuit] OPEN - {self._failures} failures, "
                  f"routing to fallback for {self.recovery_timeout}s")

# Usage with fallback
breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=30)

def resilient_call(query: str) -> str:
    if breaker.can_execute():
        try:
            result = call_primary_model(query)
            breaker.record_success()
            return result
        except Exception as e:
            breaker.record_failure()
            print(f"[Fallback] Primary failed: {e}")

    # Fallback to secondary model
    return call_fallback_model(query)

Retry Strategies with Exponential Backoff

Before falling back to a different model, you should retry the primary model with increasingly longer waits. This handles transient failures like network blips or momentary rate limits without unnecessary fallbacks.

Exponential backoff: Wait 1s, 2s, 4s, 8s between retries. Prevents overwhelming a struggling service.
Jitter: Add random variation to the wait time so that many clients don't all retry at the exact same moment.
Max retries: Always set a maximum (typically 3-5). Infinite retries create unbounded latency.
Idempotency: Ensure retries are safe. LLM calls are inherently idempotent (same input, fresh output), but if your chain has side effects (writing to a database, sending emails), guard against duplicate execution.

Python

import time
import random
import anthropic

def retry_with_backoff(func, max_retries=3, base_delay=1.0):
    """Retry a function with exponential backoff and jitter."""
    for attempt in range(max_retries + 1):
        try:
            return func()
        except anthropic.RateLimitError:
            if attempt == max_retries:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            print(f"[Retry] Rate limited, waiting {delay:.1f}s (attempt {attempt + 1})")
            time.sleep(delay)
        except anthropic.APIStatusError as e:
            if e.status_code >= 500 and attempt < max_retries:
                delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
                print(f"[Retry] Server error {e.status_code}, waiting {delay:.1f}s")
                time.sleep(delay)
            else:
                raise

Cost Savings Analysis

The following table shows realistic cost savings from cascade and fallback patterns at different traffic volumes:

Scenario	Daily Requests	Without Cascade	With Cascade	Savings
Customer Support Bot	10,000	$100/day	$35/day	65%
Document Classifier	50,000	$500/day	$125/day	75%
Code Assistant	25,000	$375/day	$180/day	52%
Content Moderation	200,000	$2,000/day	$320/day	84%
Research Assistant	5,000	$250/day	$150/day	40%

Content moderation sees the highest savings because the vast majority of content is clearly safe or clearly violating, easily handled by pattern matching or a tiny model. Research assistance sees lower savings because most queries genuinely require a capable model. Your savings will depend on the distribution of complexity in your specific workload.

Anti-Patterns to Avoid

Cascading Too Aggressively

If your confidence thresholds are too strict, nearly everything escalates to the expensive model. You've added the latency of trying cheap models without any of the cost savings. Monitor your escalation rate: if more than 50% of requests escalate, your first tier isn't handling enough.

No Quality Gates

Blindly accepting the first model's response without evaluating quality defeats the purpose of cascading. The cheap model might return a confidently wrong answer. Always validate the response against format, completeness, and consistency expectations before accepting it.

Ignoring Latency Impact

A cascade adds latency when requests escalate because they've been processed (and rejected) by lower tiers first. For the worst case (full escalation through all tiers), the total latency is the sum of all tiers. Ensure this worst-case latency is still acceptable for your application.

No Monitoring

Without monitoring, you can't know if your cascade is actually saving money or if your escalation thresholds need adjustment. Track: escalation rates per tier, quality scores at each tier, cost per request by tier, and latency distributions.

⚠

Test your fallbacks regularly. A fallback that hasn't been tested is a fallback that won't work when you need it. Run "chaos engineering" exercises where you deliberately disable the primary model and verify that the fallback handles traffic correctly. Do this in production, not just in staging.

← Previous Router & Gateway Next → Ensemble & Voting