Cascade & Fallback Pattern
The cascade pattern starts with the cheapest option and escalates only when needed. The fallback pattern switches to a backup when the primary fails. Together, they let you build AI systems that are both cost-efficient and highly reliable.
What Is the Cascade Pattern?
The cascade pattern processes a request by trying the cheapest or simplest approach first and escalating to more expensive approaches only when the cheaper one isn't confident enough. It is fundamentally an optimization pattern: you pay for the expensive model only when you need it.
Imagine a customer support system. Most questions are simple: "What are your business hours?" or "How do I reset my password?" These can be answered by a rule engine or a tiny model for fractions of a cent. Only the complex, nuanced questions — "I was charged twice and my account is locked but I need access for a presentation tomorrow" — need the full power of a large language model.
The cascade pattern formalizes this intuition into a systematic architecture. Each tier in the cascade has a confidence threshold. If the tier can handle the request with sufficient confidence, it returns the result. If not, the request cascades to the next tier.
Confidence-Based Escalation
The critical design decision in any cascade is how you measure confidence. Common approaches include:
- Model self-reported confidence: Ask the model to rate its confidence on a 0-1 scale. Simple but unreliable — models are often poorly calibrated and may be overconfident.
- Pattern matching: If the response contains hedging language ("I'm not sure", "it depends", "I think"), treat it as low confidence.
- Structural validation: Check if the response matches the expected format. A well-structured, complete response suggests higher confidence than a partial or malformed one.
- Verifier model: Use a separate model to evaluate the quality of the response. More reliable but adds latency and cost.
- Consistency check: Run the same request twice; if responses agree, confidence is higher. Expensive but effective for critical applications.
Cost Optimization Through Cascading
The financial impact of cascading can be dramatic. Consider a system processing 100,000 requests per day. Without cascading, every request goes to Claude Sonnet. With a well-tuned cascade, 60% of requests are handled by regex or a small model, 30% by Sonnet, and only 10% escalate to Opus.
The math is compelling:
- Without cascade: 100,000 requests at ~$0.01 each = $1,000/day
- With cascade: 60,000 at $0.0001 + 30,000 at $0.01 + 10,000 at $0.05 = $6 + $300 + $500 = $806/day
- But wait: Many of those 60,000 cheap responses are instant (regex), improving user experience too
In practice, well-tuned cascades typically save 30-70% on model costs while maintaining or improving response quality, because each tier is optimized for its specific difficulty level.
Cascade vs Router: When to Use Which
The cascade and router patterns may seem similar — both direct requests to different models — but they solve different problems:
- Router: Decides upfront which model to use based on input characteristics. The decision is made once, before any model is called. Best when you can reliably predict the right model from the input alone.
- Cascade: Tries models in sequence, escalating based on output quality. Multiple models may be called for a single request. Best when you can't predict difficulty upfront but can evaluate output quality.
In practice, many systems combine both: a router handles obvious cases (clearly simple or clearly complex), and the ambiguous middle gets cascaded.
Code Example: 3-Tier Cascade
This cascade tries regex rules first (instant, free), then a small model (fast, cheap), and finally a large model (slow, expensive). Each tier evaluates whether it can handle the request confidently:
import anthropic
import re
import json
import time
client = anthropic.Anthropic()
# Tier 1: Regex / rule-based responses (instant, free)
FAQ_PATTERNS = {
r"(?i)(business|office)\s*hours?": "Our business hours are Monday-Friday, 9 AM - 6 PM EST.",
r"(?i)reset\s*(my\s*)?password": "To reset your password, visit account.example.com/reset and follow the instructions.",
r"(?i)(refund|return)\s*policy": "We offer full refunds within 30 days of purchase. Visit example.com/returns to start a return.",
r"(?i)contact\s*(support|us|help)": "You can reach support at support@example.com or call 1-800-EXAMPLE.",
r"(?i)(cancel|cancellation)": "To cancel your subscription, go to Settings > Subscription > Cancel. You'll retain access until the billing period ends.",
r"(?i)^(hi|hello|hey|good\s*(morning|afternoon|evening))": "Hello! How can I help you today?",
}
def tier1_regex(query: str) -> dict | None:
"""Tier 1: Pattern matching for common queries."""
for pattern, response in FAQ_PATTERNS.items():
if re.search(pattern, query):
return {
"tier": 1,
"method": "regex",
"response": response,
"confidence": 0.95,
"cost": 0.0,
"latency_ms": 0
}
return None # Can't handle -> cascade to tier 2
def tier2_small_model(query: str) -> dict | None:
"""Tier 2: Small model for straightforward questions."""
start = time.time()
response = client.messages.create(
model="claude-haiku-4-20250514",
max_tokens=1000,
system=(
"Answer the user's question concisely. "
"If you're not confident in your answer, start with 'UNSURE:' "
"and explain why."
),
messages=[{"role": "user", "content": query}]
)
latency = (time.time() - start) * 1000
text = response.content[0].text
# Check confidence signals
if text.startswith("UNSURE:"):
return None # Not confident -> cascade to tier 3
# Check for hedging language
hedging_phrases = ["i'm not sure", "i think", "it depends", "possibly",
"i don't have enough", "i cannot determine"]
hedging_count = sum(1 for phrase in hedging_phrases if phrase in text.lower())
if hedging_count >= 2:
return None # Too much uncertainty -> cascade
return {
"tier": 2,
"method": "haiku",
"response": text,
"confidence": max(0.5, 0.9 - hedging_count * 0.15),
"cost": (response.usage.input_tokens + response.usage.output_tokens) * 0.00000025,
"latency_ms": round(latency)
}
def tier3_large_model(query: str) -> dict:
"""Tier 3: Large model for complex questions (always returns)."""
start = time.time()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4000,
system="Provide a thorough, well-structured answer. Be helpful and precise.",
messages=[{"role": "user", "content": query}]
)
latency = (time.time() - start) * 1000
return {
"tier": 3,
"method": "sonnet",
"response": response.content[0].text,
"confidence": 0.95,
"cost": (response.usage.input_tokens + response.usage.output_tokens) * 0.000003,
"latency_ms": round(latency)
}
def cascade(query: str) -> dict:
"""Execute the 3-tier cascade."""
# Try each tier in order
result = tier1_regex(query)
if result:
print(f"[Cascade] Handled at Tier 1 (regex) - {result['latency_ms']}ms")
return result
result = tier2_small_model(query)
if result:
print(f"[Cascade] Handled at Tier 2 (Haiku) - {result['latency_ms']}ms")
return result
result = tier3_large_model(query)
print(f"[Cascade] Handled at Tier 3 (Sonnet) - {result['latency_ms']}ms")
return result
# Usage examples
print(cascade("What are your business hours?")) # Tier 1
print(cascade("How do I sort a list in Python?")) # Tier 2
print(cascade("Compare CQRS vs event sourcing...")) # Tier 3
Code Example: Quality-Aware Cascade with Self-Evaluation
This more sophisticated cascade uses the model itself to evaluate whether its response is good enough, and escalates if the self-evaluation fails:
import anthropic
import json
client = anthropic.Anthropic()
def self_evaluate(query: str, response: str, model: str) -> dict:
"""Ask a model to evaluate the quality of a response."""
eval_result = client.messages.create(
model=model,
max_tokens=300,
messages=[{
"role": "user",
"content": (
"Evaluate this response to the given question.\n\n"
f"Question: {query}\n\n"
f"Response: {response}\n\n"
"Rate on these criteria (1-5 each):\n"
"- accuracy: Is the information correct?\n"
"- completeness: Does it fully answer the question?\n"
"- clarity: Is it well-written and easy to understand?\n\n"
'Return JSON: {"accuracy": N, "completeness": N, '
'"clarity": N, "overall": N, "should_escalate": bool}'
)
}]
)
return json.loads(eval_result.content[0].text)
def quality_cascade(query: str, min_quality: float = 3.5) -> dict:
"""Cascade with quality self-evaluation at each tier."""
tiers = [
("claude-haiku-4-20250514", "fast"),
("claude-sonnet-4-20250514", "balanced"),
("claude-opus-4-20250514", "powerful")
]
for i, (model, tier_name) in enumerate(tiers):
# Generate response
response = client.messages.create(
model=model,
max_tokens=4000,
messages=[{"role": "user", "content": query}]
)
text = response.content[0].text
# Last tier: return without evaluation (no escalation possible)
if i == len(tiers) - 1:
return {
"tier": tier_name, "model": model,
"response": text, "escalation_reason": None
}
# Evaluate quality (use same model for speed)
evaluation = self_evaluate(query, text, model)
if evaluation["overall"] >= min_quality and not evaluation["should_escalate"]:
return {
"tier": tier_name, "model": model,
"response": text, "quality": evaluation,
"escalation_reason": None
}
else:
print(f"[Cascade] {tier_name} scored {evaluation['overall']}/5, "
f"escalating to next tier...")
# Should never reach here, but safety fallback
return {"error": "All tiers exhausted"}
The Fallback Pattern
While the cascade is about cost optimization (cheap first, expensive if needed), the fallback pattern is about reliability (primary fails, backup takes over). The trigger is different: cascades escalate based on quality; fallbacks activate based on failures.
Common failure modes that trigger fallbacks:
- API errors: 500 errors, timeouts, rate limits from the primary provider
- Content filtering: Primary model refuses the request due to safety filters, but the request is legitimate
- Capacity issues: Primary model is overloaded and response times are unacceptable
- Regional outages: Provider has an outage in a specific region
- Model deprecation: Primary model version is retired or degraded
Circuit Breaker Integration
The circuit breaker pattern prevents your system from repeatedly calling a failing service. When errors exceed a threshold, the circuit "opens" and all requests are immediately routed to the fallback without even trying the primary. After a cooling period, the circuit moves to "half-open" — a few test requests are sent to the primary. If they succeed, the circuit closes and normal routing resumes.
import time
from dataclasses import dataclass, field
@dataclass
class CircuitBreaker:
"""Circuit breaker for AI model calls."""
failure_threshold: int = 5 # Failures before opening
recovery_timeout: int = 60 # Seconds before trying again
half_open_max_calls: int = 3 # Test calls in half-open state
_failures: int = field(default=0, init=False)
_state: str = field(default="closed", init=False) # closed, open, half-open
_last_failure_time: float = field(default=0, init=False)
_half_open_successes: int = field(default=0, init=False)
def can_execute(self) -> bool:
"""Check if the circuit allows a request."""
if self._state == "closed":
return True
if self._state == "open":
# Check if recovery timeout has elapsed
if time.time() - self._last_failure_time > self.recovery_timeout:
self._state = "half-open"
self._half_open_successes = 0
print("[Circuit] Moving to half-open state")
return True
return False
# half-open: allow limited test calls
return True
def record_success(self):
"""Record a successful call."""
if self._state == "half-open":
self._half_open_successes += 1
if self._half_open_successes >= self.half_open_max_calls:
self._state = "closed"
self._failures = 0
print("[Circuit] Closed - service recovered")
else:
self._failures = max(0, self._failures - 1)
def record_failure(self):
"""Record a failed call."""
self._failures += 1
self._last_failure_time = time.time()
if self._failures >= self.failure_threshold:
self._state = "open"
print(f"[Circuit] OPEN - {self._failures} failures, "
f"routing to fallback for {self.recovery_timeout}s")
# Usage with fallback
breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=30)
def resilient_call(query: str) -> str:
if breaker.can_execute():
try:
result = call_primary_model(query)
breaker.record_success()
return result
except Exception as e:
breaker.record_failure()
print(f"[Fallback] Primary failed: {e}")
# Fallback to secondary model
return call_fallback_model(query)
Retry Strategies with Exponential Backoff
Before falling back to a different model, you should retry the primary model with increasingly longer waits. This handles transient failures like network blips or momentary rate limits without unnecessary fallbacks.
- Exponential backoff: Wait 1s, 2s, 4s, 8s between retries. Prevents overwhelming a struggling service.
- Jitter: Add random variation to the wait time so that many clients don't all retry at the exact same moment.
- Max retries: Always set a maximum (typically 3-5). Infinite retries create unbounded latency.
- Idempotency: Ensure retries are safe. LLM calls are inherently idempotent (same input, fresh output), but if your chain has side effects (writing to a database, sending emails), guard against duplicate execution.
import time
import random
import anthropic
def retry_with_backoff(func, max_retries=3, base_delay=1.0):
"""Retry a function with exponential backoff and jitter."""
for attempt in range(max_retries + 1):
try:
return func()
except anthropic.RateLimitError:
if attempt == max_retries:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
print(f"[Retry] Rate limited, waiting {delay:.1f}s (attempt {attempt + 1})")
time.sleep(delay)
except anthropic.APIStatusError as e:
if e.status_code >= 500 and attempt < max_retries:
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
print(f"[Retry] Server error {e.status_code}, waiting {delay:.1f}s")
time.sleep(delay)
else:
raise
Cost Savings Analysis
The following table shows realistic cost savings from cascade and fallback patterns at different traffic volumes:
| Scenario | Daily Requests | Without Cascade | With Cascade | Savings |
|---|---|---|---|---|
| Customer Support Bot | 10,000 | $100/day | $35/day | 65% |
| Document Classifier | 50,000 | $500/day | $125/day | 75% |
| Code Assistant | 25,000 | $375/day | $180/day | 52% |
| Content Moderation | 200,000 | $2,000/day | $320/day | 84% |
| Research Assistant | 5,000 | $250/day | $150/day | 40% |
Content moderation sees the highest savings because the vast majority of content is clearly safe or clearly violating, easily handled by pattern matching or a tiny model. Research assistance sees lower savings because most queries genuinely require a capable model. Your savings will depend on the distribution of complexity in your specific workload.
Anti-Patterns to Avoid
Cascading Too Aggressively
If your confidence thresholds are too strict, nearly everything escalates to the expensive model. You've added the latency of trying cheap models without any of the cost savings. Monitor your escalation rate: if more than 50% of requests escalate, your first tier isn't handling enough.
No Quality Gates
Blindly accepting the first model's response without evaluating quality defeats the purpose of cascading. The cheap model might return a confidently wrong answer. Always validate the response against format, completeness, and consistency expectations before accepting it.
Ignoring Latency Impact
A cascade adds latency when requests escalate because they've been processed (and rejected) by lower tiers first. For the worst case (full escalation through all tiers), the total latency is the sum of all tiers. Ensure this worst-case latency is still acceptable for your application.
No Monitoring
Without monitoring, you can't know if your cascade is actually saving money or if your escalation thresholds need adjustment. Track: escalation rates per tier, quality scores at each tier, cost per request by tier, and latency distributions.
Lilly Tech Systems