Intermediate

Defense Strategies

No single defense can completely prevent prompt injection. Effective security requires a layered approach combining input validation, output filtering, architectural decisions, and continuous monitoring.

Defense in Depth

The most effective protection against prompt injection is defense in depth — multiple independent layers of security, each catching attacks that slip through previous layers.

  1. Layer 1: Input Sanitization

    Clean and validate user input before it reaches the model. Strip suspicious patterns, normalize encodings, and detect known attack signatures.

  2. Layer 2: Prompt Hardening

    Design system prompts that are resistant to override attempts. Use delimiters, instruction repetition, and explicit boundary markers.

  3. Layer 3: Privilege Separation

    Limit what the model can do. Restrict tool access, implement least-privilege principles, and require human approval for sensitive actions.

  4. Layer 4: Output Filtering

    Validate model outputs before they reach the user. Check for policy violations, data leaks, and unexpected behaviors.

  5. Layer 5: Monitoring and Detection

    Continuously monitor for anomalous patterns that indicate active attacks, including unusual request patterns and output anomalies.

Input Sanitization

Python - Input Sanitization Pipeline
import re

class InputSanitizer:
    SUSPICIOUS_PATTERNS = [
        r"ignore\s+(all\s+)?previous\s+instructions",
        r"system\s*prompt",
        r"you\s+are\s+now",
        r"new\s+instructions?",
        r"---\s*end\s*(of)?\s*(system)?",
        r"pretend\s+(you\s+are|to\s+be)",
    ]

    def sanitize(self, text: str) -> tuple[str, float]:
        risk_score = 0.0

        # Check for known injection patterns
        for pattern in self.SUSPICIOUS_PATTERNS:
            if re.search(pattern, text, re.IGNORECASE):
                risk_score += 0.3

        # Check for encoding obfuscation
        if self.has_base64_content(text):
            risk_score += 0.2

        # Check for invisible characters
        text = self.strip_invisible_chars(text)

        return text, min(risk_score, 1.0)

Prompt Hardening Techniques

Technique Description Effectiveness
XML/Delimiter Wrapping Wrap user input in clear delimiters: <user_input>...</user_input> Medium
Instruction Repetition Repeat critical instructions both before and after user input Medium
Canary Tokens Include secret strings in the system prompt; if they appear in output, injection detected Medium-High
Dual LLM Pattern Use a separate model to evaluate whether user input is an injection attempt High
Output Constraining Require structured output (JSON, specific format) that limits what the model can express Medium

Output Filtering

Even with strong input guards, the model may still produce problematic outputs. Output filtering provides a final checkpoint:

  • Content classifiers: Run model output through a safety classifier before returning to the user
  • Regex validation: Check for sensitive patterns like URLs, API keys, or internal system details
  • Schema enforcement: If the expected output has a defined structure, validate against it
  • Semantic analysis: Use a separate model to verify the response is relevant and appropriate
  • PII detection: Scan outputs for personal information and redact before delivery

Architectural Defenses

Least Privilege

Only give the model access to the minimum set of tools and data it needs. Never grant database write access if read-only suffices.

Human-in-the-Loop

Require human approval for high-stakes actions like sending emails, making purchases, or modifying data.

Data Isolation

Process retrieved data separately from user input. Sanitize external content before including it in the model's context.

Rate Limiting

Limit request frequency and token usage per user to prevent brute-force injection attempts and abuse.

Key Principle: Assume injection will eventually succeed. Design your system so that even a successful injection causes minimal damage. Limit blast radius through privilege separation, monitoring, and fail-safe defaults.