Pattern: Guardrails and Safety

Pattern: Guardrails and Safety

Category: Tool Use Source: FOR-0012 Status: Documented

When to Use

When an agent operates in an environment where it could produce harmful, non-compliant, or off-brand outputs — or where adversarial inputs (prompt injection, data exfiltration attempts) are a concern. Essential for any customer-facing digital talent or system handling sensitive data.

How It Works

  • Input guardrails: Validate and sanitize incoming requests before processing
    • Check for injection attempts (prompt injection, SQL injection, etc.)
    • Classify input risk level (low, medium, high)
    • Block or flag high-risk inputs before they reach the core agent
  • Output guardrails: Validate agent outputs before delivery
    • Check against policies, ethical guidelines, brand safety rules, compliance requirements
    • Generate a safety score; block or flag outputs above a risk threshold
    • Apply content filtering or redaction as needed
  • Tool restrictions: Limit which tools an agent can access based on context
    • Sandbox dangerous operations
    • Require additional confirmation for destructive actions
  • Log all guardrail activations for monitoring and tuning

Example

A digital talent handling client communications for an accounting firm. Input guardrails detect if a user tries to extract confidential data through social engineering. Output guardrails ensure the agent never provides specific tax advice (which requires a licensed professional), instead saying "I recommend consulting with your accountant on this specific question" and escalating.

Tradeoffs

Pro Con
Prevents harmful or non-compliant outputs Adds processing latency to every interaction
Protects against adversarial attacks Over-aggressive guardrails block legitimate requests
Builds trust for enterprise and regulated use cases Guardrail rules need ongoing maintenance and tuning
Creates compliance audit trail False positives frustrate users

Factory Usage

  • Agent boundary enforcement: Each agent.md defines explicit "should NOT activate when" rules — a form of input guardrail that prevents scope creep.
  • Role Factory verification checklist: The deploy stage checks for naming conflicts, missing files, and quality scores — output guardrails before committing.