Defense Strategies
No single defense can completely prevent prompt injection. Effective security requires a layered approach combining input validation, output filtering, architectural decisions, and continuous monitoring.
Defense in Depth
The most effective protection against prompt injection is defense in depth — multiple independent layers of security, each catching attacks that slip through previous layers.
-
Layer 1: Input Sanitization
Clean and validate user input before it reaches the model. Strip suspicious patterns, normalize encodings, and detect known attack signatures.
-
Layer 2: Prompt Hardening
Design system prompts that are resistant to override attempts. Use delimiters, instruction repetition, and explicit boundary markers.
-
Layer 3: Privilege Separation
Limit what the model can do. Restrict tool access, implement least-privilege principles, and require human approval for sensitive actions.
-
Layer 4: Output Filtering
Validate model outputs before they reach the user. Check for policy violations, data leaks, and unexpected behaviors.
-
Layer 5: Monitoring and Detection
Continuously monitor for anomalous patterns that indicate active attacks, including unusual request patterns and output anomalies.
Input Sanitization
import re class InputSanitizer: SUSPICIOUS_PATTERNS = [ r"ignore\s+(all\s+)?previous\s+instructions", r"system\s*prompt", r"you\s+are\s+now", r"new\s+instructions?", r"---\s*end\s*(of)?\s*(system)?", r"pretend\s+(you\s+are|to\s+be)", ] def sanitize(self, text: str) -> tuple[str, float]: risk_score = 0.0 # Check for known injection patterns for pattern in self.SUSPICIOUS_PATTERNS: if re.search(pattern, text, re.IGNORECASE): risk_score += 0.3 # Check for encoding obfuscation if self.has_base64_content(text): risk_score += 0.2 # Check for invisible characters text = self.strip_invisible_chars(text) return text, min(risk_score, 1.0)
Prompt Hardening Techniques
| Technique | Description | Effectiveness |
|---|---|---|
| XML/Delimiter Wrapping | Wrap user input in clear delimiters: <user_input>...</user_input> | Medium |
| Instruction Repetition | Repeat critical instructions both before and after user input | Medium |
| Canary Tokens | Include secret strings in the system prompt; if they appear in output, injection detected | Medium-High |
| Dual LLM Pattern | Use a separate model to evaluate whether user input is an injection attempt | High |
| Output Constraining | Require structured output (JSON, specific format) that limits what the model can express | Medium |
Output Filtering
Even with strong input guards, the model may still produce problematic outputs. Output filtering provides a final checkpoint:
- Content classifiers: Run model output through a safety classifier before returning to the user
- Regex validation: Check for sensitive patterns like URLs, API keys, or internal system details
- Schema enforcement: If the expected output has a defined structure, validate against it
- Semantic analysis: Use a separate model to verify the response is relevant and appropriate
- PII detection: Scan outputs for personal information and redact before delivery
Architectural Defenses
Least Privilege
Only give the model access to the minimum set of tools and data it needs. Never grant database write access if read-only suffices.
Human-in-the-Loop
Require human approval for high-stakes actions like sending emails, making purchases, or modifying data.
Data Isolation
Process retrieved data separately from user input. Sanitize external content before including it in the model's context.
Rate Limiting
Limit request frequency and token usage per user to prevent brute-force injection attempts and abuse.
Lilly Tech Systems