Incident Response & Recovery
Despite all guardrails, incidents will happen. When an AI agent causes infrastructure damage, a structured response is critical. This lesson provides the playbook for containing, recovering from, and learning from AI agent incidents.
Immediate Response: The First 5 Minutes
When you realize an AI agent has caused damage, speed matters. Follow these steps in order:
Stop the Agent Immediately
Kill the agent process. Close the terminal. If using Claude Code, press Ctrl+C immediately. Don't let it continue executing commands.
Revoke Agent Credentials
Immediately revoke or rotate the credentials the agent was using. This prevents further damage if the agent session is still somehow active.
Assess What Happened
Check the agent's command history, terminal output, and audit logs to understand exactly what commands were executed.
Alert the Team
Notify the infrastructure team, on-call engineer, and your manager. Don't try to fix everything alone.
#!/bin/bash # emergency-revoke.sh - Run immediately when agent incident detected echo "🚨 EMERGENCY: Revoking agent credentials..." # AWS: Deactivate the access key used by the agent aws iam update-access-key \ --user-name agent-user \ --access-key-id AKIA... \ --status Inactive # AWS: Revoke all active sessions for the role aws iam put-role-policy \ --role-name AgentRole \ --policy-name EmergencyDeny \ --policy-document '{ "Version": "2012-10-17", "Statement": [{ "Effect": "Deny", "Action": "*", "Resource": "*" }] }' # Kubernetes: Delete the agent's service account token kubectl delete secret agent-sa-token -n default # Git: Revoke the agent's deploy key or PAT gh api -X DELETE /repos/org/repo/keys/KEY_ID echo "✅ Agent credentials revoked." echo " Review CloudTrail/audit logs for full damage assessment."
Assessment: Determining Blast Radius
Before attempting recovery, you need to understand the full scope of damage. Use audit logs to trace every action the agent took:
# Find all API calls made by the agent's credentials in the last hour aws cloudtrail lookup-events \ --lookup-attributes AttributeKey=AccessKeyId,AttributeValue=AKIA... \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --query 'Events[].{Time:EventTime,Name:EventName,Resource:Resources[0].ResourceName}' \ --output table # For more detail, search CloudTrail logs in S3 aws s3 ls s3://cloudtrail-bucket/AWSLogs/ --recursive \ | grep $(date +%Y/%m/%d) # Azure: Query Activity Log az monitor activity-log list \ --caller agent@company.com \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \ --query '[].{Time:eventTimestamp,Op:operationName.value,Status:status.value}' # GCP: Query Audit Logs gcloud logging read \ 'protoPayload.authenticationInfo.principalEmail="agent-sa@project.iam.gserviceaccount.com"' \ --freshness=1h \ --format='table(timestamp,protoPayload.methodName,protoPayload.resourceName)'
Recovery from Backups and Snapshots
Once you understand the blast radius, begin recovery using your backup and snapshot strategy:
| Resource Type | Recovery Method | Typical RTO |
|---|---|---|
| RDS Database | Point-in-time restore from automated backups | 15-30 minutes |
| S3 Objects | Restore from versioning or cross-region replication | Minutes to hours |
| EC2 Instances | Launch from AMI backups or EBS snapshots | 5-15 minutes |
| DynamoDB Tables | Point-in-time recovery (PITR) or on-demand backup | Minutes |
| Kubernetes Resources | Re-apply from git (GitOps) or Velero backup | 5-10 minutes |
| Terraform State | Restore from state backup in S3 | Minutes |
| Git Repository | Recover from reflog, restore force-pushed commits | Minutes |
# Restore RDS from point-in-time aws rds restore-db-instance-to-point-in-time \ --source-db-instance-identifier production-db \ --target-db-instance-identifier production-db-restored \ --restore-time "2026-03-20T10:30:00Z" \ --db-instance-class db.r6g.large # Restore S3 object from versioning aws s3api list-object-versions \ --bucket my-bucket \ --prefix important-data/ \ --query 'Versions[?IsLatest==`false`]' # Recover deleted objects using delete markers aws s3api delete-object \ --bucket my-bucket \ --key important-data/file.json \ --version-id "DELETE_MARKER_VERSION_ID" # Restore Kubernetes from Velero backup velero restore create --from-backup daily-backup-20260320 # Git: Recover force-pushed branch git reflog show origin/main git push origin RECOVERED_SHA:main
Post-Incident Review
After recovery, conduct a blameless post-incident review. The goal is to understand what happened and prevent recurrence:
# AI Agent Incident Report ## Incident Summary - **Date/Time:** 2026-03-20 10:30 UTC - **Duration:** 45 minutes (detection to recovery) - **Severity:** SEV-2 - **Agent:** Claude Code v1.x - **Developer:** [name] ## What Happened 1. Developer asked agent to "clean up unused resources in staging" 2. Agent identified resources based on tags and last-activity timestamps 3. Agent ran `terraform destroy` targeting a module that included shared infrastructure used by both staging and production 4. Production API gateway was destroyed along with staging resources ## Impact - Production API was down for 30 minutes - ~2,000 API requests failed during the outage - No data loss (databases were on separate infrastructure) ## Root Cause - The Terraform module bundled staging and production resources together - Agent had production-level credentials (not least-privilege) - No guardrail script was in place to block `terraform destroy` ## What Went Well - CloudTrail alert fired within 2 minutes - RDS was unaffected due to deletion protection - Recovery from Terraform state took only 15 minutes ## Action Items - [ ] Separate staging and production into distinct Terraform modules - [ ] Implement agent-specific IAM role with no delete permissions - [ ] Add guardrail script to block `terraform destroy` commands - [ ] Configure deletion protection on all production resources - [ ] Add CLAUDE.md rules to forbid destroy commands
Building Runbooks for AI Agent Incidents
Pre-built runbooks ensure consistent, fast responses. Create runbooks for common agent incident scenarios:
| Runbook | Trigger | Key Steps |
|---|---|---|
| Agent Deleted Cloud Resources | CloudTrail alert for Delete* API calls | Revoke creds, assess blast radius, restore from backup/snapshot |
| Agent Force-Pushed to Main | Git webhook for force push events | Recover from reflog, notify team, revert |
| Agent Exposed Secrets | Secret scanning alert, commit with credentials | Rotate all exposed credentials, git filter-branch to remove |
| Agent Caused Cost Spike | AWS Budget alert threshold exceeded | Identify resources, terminate/scale down, review bill |
| Agent Modified Prod Config | Config drift detection alert | Revert config, restart services, verify health |
Communication Templates
Have pre-written templates ready for stakeholder communication during an incident:
--- INITIAL NOTIFICATION (within 5 minutes) --- Subject: [INCIDENT] Production impact from AI agent action Team, We've identified a production incident caused by an AI coding agent. - **Impact:** [describe user-facing impact] - **Status:** Investigating and containing - **ETA for update:** 15 minutes We have revoked the agent's credentials and are assessing blast radius. --- STATUS UPDATE (every 15-30 minutes) --- Subject: [INCIDENT UPDATE] [service] - Recovery in progress - **Current status:** Recovery in progress - **Actions taken:** [list actions] - **ETA for resolution:** [time estimate] - **Workarounds:** [if any] --- RESOLUTION --- Subject: [RESOLVED] Production incident from AI agent action The incident has been resolved. - **Duration:** [total time] - **Root cause:** AI agent executed [command] which [effect] - **Resolution:** Restored from [backup/snapshot/revert] - **Follow-up:** Post-incident review scheduled for [date]
Lessons Learned from Real Incidents
Common patterns seen across AI agent incidents:
- Ambiguous instructions are dangerous: "Clean up old resources" is interpreted differently by every agent. Be specific: "Delete EC2 instances tagged env=test that have been stopped for 30+ days."
- Agents don't understand blast radius: An agent doesn't know that deleting "that one S3 bucket" means losing 5TB of customer data. Always require explicit confirmation for delete operations.
- Shared infrastructure is the biggest risk: Resources shared between environments (staging/production) are the most common source of agent-caused outages.
- Recovery time depends on preparation: Teams with backups, snapshots, and deletion protection recovered in minutes. Teams without took hours or days.
- The cost of guardrails is always less than the cost of an outage: Even a 30-minute production outage costs more than setting up proper agent safety.
Lilly Tech Systems