Intermediate

Defense Strategies

No single defense can completely prevent prompt injection. Effective security requires a layered approach combining input validation, output filtering, architectural decisions, and continuous monitoring.

Defense in Depth

The most effective protection against prompt injection is defense in depth — multiple independent layers of security, each catching attacks that slip through previous layers.

Layer 1: Input Sanitization

Clean and validate user input before it reaches the model. Strip suspicious patterns, normalize encodings, and detect known attack signatures.
Layer 2: Prompt Hardening

Design system prompts that are resistant to override attempts. Use delimiters, instruction repetition, and explicit boundary markers.
Layer 3: Privilege Separation

Limit what the model can do. Restrict tool access, implement least-privilege principles, and require human approval for sensitive actions.
Layer 4: Output Filtering

Validate model outputs before they reach the user. Check for policy violations, data leaks, and unexpected behaviors.
Layer 5: Monitoring and Detection

Continuously monitor for anomalous patterns that indicate active attacks, including unusual request patterns and output anomalies.

Input Sanitization

Python - Input Sanitization Pipeline

import re

class InputSanitizer:
    SUSPICIOUS_PATTERNS = [
        r"ignore\s+(all\s+)?previous\s+instructions",
        r"system\s*prompt",
        r"you\s+are\s+now",
        r"new\s+instructions?",
        r"---\s*end\s*(of)?\s*(system)?",
        r"pretend\s+(you\s+are|to\s+be)",
    ]

    def sanitize(self, text: str) -> tuple[str, float]:
        risk_score = 0.0

        # Check for known injection patterns
        for pattern in self.SUSPICIOUS_PATTERNS:
            if re.search(pattern, text, re.IGNORECASE):
                risk_score += 0.3

        # Check for encoding obfuscation
        if self.has_base64_content(text):
            risk_score += 0.2

        # Check for invisible characters
        text = self.strip_invisible_chars(text)

        return text, min(risk_score, 1.0)

Prompt Hardening Techniques

Technique	Description	Effectiveness
XML/Delimiter Wrapping	Wrap user input in clear delimiters: <user_input>...</user_input>	Medium
Instruction Repetition	Repeat critical instructions both before and after user input	Medium
Canary Tokens	Include secret strings in the system prompt; if they appear in output, injection detected	Medium-High
Dual LLM Pattern	Use a separate model to evaluate whether user input is an injection attempt	High
Output Constraining	Require structured output (JSON, specific format) that limits what the model can express	Medium

Output Filtering

Even with strong input guards, the model may still produce problematic outputs. Output filtering provides a final checkpoint:

Content classifiers: Run model output through a safety classifier before returning to the user
Regex validation: Check for sensitive patterns like URLs, API keys, or internal system details
Schema enforcement: If the expected output has a defined structure, validate against it
Semantic analysis: Use a separate model to verify the response is relevant and appropriate
PII detection: Scan outputs for personal information and redact before delivery

Architectural Defenses

Least Privilege

Only give the model access to the minimum set of tools and data it needs. Never grant database write access if read-only suffices.

Human-in-the-Loop

Require human approval for high-stakes actions like sending emails, making purchases, or modifying data.

Data Isolation

Process retrieved data separately from user input. Sanitize external content before including it in the model's context.

Rate Limiting

Limit request frequency and token usage per user to prevent brute-force injection attempts and abuse.

✅

Key Principle: Assume injection will eventually succeed. Design your system so that even a successful injection causes minimal damage. Limit blast radius through privilege separation, monitoring, and fail-safe defaults.

← Previous Attack Types Next → Testing

Defense Strategies

Defense in Depth

Layer 1: Input Sanitization

Layer 2: Prompt Hardening

Layer 3: Privilege Separation

Layer 4: Output Filtering

Layer 5: Monitoring and Detection

Input Sanitization

Prompt Hardening Techniques

Output Filtering

Architectural Defenses

Least Privilege

Human-in-the-Loop

Data Isolation

Rate Limiting