Pattern: Guardrails and Safety

Category: Tool Use Source: FOR-0012 Status: Documented

When to Use

When an agent operates in an environment where it could produce harmful, non-compliant, or off-brand outputs — or where adversarial inputs (prompt injection, data exfiltration attempts) are a concern. Essential for any customer-facing digital talent or system handling sensitive data.

How It Works

Input guardrails: Validate and sanitize incoming requests before processing
- Check for injection attempts (prompt injection, SQL injection, etc.)
- Classify input risk level (low, medium, high)
- Block or flag high-risk inputs before they reach the core agent
Output guardrails: Validate agent outputs before delivery
- Check against policies, ethical guidelines, brand safety rules, compliance requirements
- Generate a safety score; block or flag outputs above a risk threshold
- Apply content filtering or redaction as needed
Tool restrictions: Limit which tools an agent can access based on context
- Sandbox dangerous operations
- Require additional confirmation for destructive actions
Log all guardrail activations for monitoring and tuning

Example

A digital talent handling client communications for an accounting firm. Input guardrails detect if a user tries to extract confidential data through social engineering. Output guardrails ensure the agent never provides specific tax advice (which requires a licensed professional), instead saying "I recommend consulting with your accountant on this specific question" and escalating.

Tradeoffs

Pro	Con
Prevents harmful or non-compliant outputs	Adds processing latency to every interaction
Protects against adversarial attacks	Over-aggressive guardrails block legitimate requests
Builds trust for enterprise and regulated use cases	Guardrail rules need ongoing maintenance and tuning
Creates compliance audit trail	False positives frustrate users

Factory Usage

Agent boundary enforcement: Each agent.md defines explicit "should NOT activate when" rules — a form of input guardrail that prevents scope creep.
Role Factory verification checklist: The deploy stage checks for naming conflicts, missing files, and quality scores — output guardrails before committing.