Patterns
Pattern: Fallback and Recovery
Pattern: Fallback and Recovery
Category: Tool Use Source: FOR-0012 Status: Documented
When to Use
When operations can fail and the system needs to degrade gracefully rather than crash. Essential for production systems where reliability matters — tool calls may fail, APIs may be down, LLM responses may be malformed. The agent needs backup plans.
How It Works
- Attempt the primary operation (tool call, API request, generation)
- If it fails, analyze the failure type (timeout, bad input, service down, malformed response)
- Apply the appropriate fallback strategy:
- Retry: Same operation, possibly with adjusted parameters
- Alternative tool: Use a different tool that achieves the same goal
- Simpler method: Fall back to a less sophisticated but more reliable approach
- Cached/default response: Use saved data or default answers
- Human escalation: Alert a human if no automated fallback works
- Log the failure and recovery for later analysis
- Continue processing with the fallback result
Example
A digital talent that pulls real-time pricing data from an API. If the API times out, it retries once. If it fails again, it falls back to cached pricing from the last successful fetch. If no cache exists, it flags the report as "pricing data unavailable — manual update required" and escalates to the human operator.
Tradeoffs
| Pro | Con |
|---|---|
| System stays operational despite failures | Each fallback layer adds complexity |
| Builds user trust through reliability | Fallback responses may be lower quality |
| Failures are logged for systematic improvement | Over-engineering fallbacks for rare failures wastes effort |
| Graceful degradation over hard crashes | Must test each fallback path independently |
Factory Usage
- Role Factory auto-improve: If a modification does not improve the score, the change is reverted (fallback to previous version).
- Agent trigger system: Non-trigger phrases redirect to the correct agent rather than failing silently — a form of routing fallback.