Best Practices & Safety Checklist
This final lesson brings everything together into an actionable checklist, team policies, tool-specific configuration guides, and a maturity model for measuring your organization's AI agent safety posture.
Complete AI Agent Safety Checklist
Use this checklist before allowing AI agents to interact with any infrastructure:
Before Every Agent Session
- Verify the agent is using scoped, task-specific credentials (not personal admin keys)
- Confirm the agent is pointed at the correct environment (dev/staging, not production)
- Review CLAUDE.md or equivalent config file for safety rules
- Ensure guardrail scripts are active and in the
$PATH - Verify you have recent backups of any resources the agent might touch
Infrastructure Safety
- Deletion protection enabled on all production databases (RDS, DynamoDB, Cloud SQL)
- S3/GCS bucket versioning enabled on all production buckets
- Azure resource locks applied to critical resource groups
- Terraform
prevent_destroylifecycle rule on critical resources - SCPs or organization policies blocking destructive actions from agent accounts
- Budget alerts configured with hard caps on agent accounts
Code and Process Safety
- Branch protection rules enforced on main/production branches
- CODEOWNERS file requiring infrastructure team review for IaC changes
- CI/CD pipeline runs plan and posts output on every PR
- Destructive plan changes block PR merge automatically
- Production apply only happens through CI/CD pipeline, never directly
- Agent incident runbooks documented and accessible
Team Policies for AI Agent Usage
Establish clear organizational policies for how AI agents are used:
# AI Agent Usage Policy v1.0 ## Scope This policy applies to all team members using AI coding agents (Claude Code, GitHub Copilot, Cursor, Windsurf, Codex, Aider) in any project that interacts with cloud infrastructure. ## Required Practices 1. **Credentials:** Use dedicated agent IAM roles with read-only access by default. Write access requires team lead approval. 2. **Environment:** Agents must target dev/staging environments. Production changes go through GitOps workflow only. 3. **CLAUDE.md:** All projects must have a CLAUDE.md with safety rules before agents are used for infrastructure tasks. 4. **Review:** All agent-generated infrastructure changes must be reviewed by a human before applying. 5. **Logging:** All agent sessions involving infrastructure must be logged (terminal recording or audit log). ## Prohibited Actions - Running `terraform apply` or `terraform destroy` directly - Giving agents production database credentials - Using agents with personal admin AWS/Azure/GCP credentials - Allowing agents to push directly to main/production branches - Running agents in full-auto mode for infrastructure tasks ## Incident Reporting Any unintended infrastructure change caused by an AI agent must be reported to the infrastructure team within 15 minutes.
Training Recommendations
Ensure your team is prepared for safe AI agent usage:
| Audience | Training Focus | Frequency |
|---|---|---|
| All Developers | Agent permission models, CLAUDE.md best practices, when to use dry-run | At onboarding + quarterly |
| Infrastructure Team | Guardrail scripts, OPA policies, CI/CD safety patterns, incident response | Monthly |
| Team Leads | Agent usage policies, risk assessment, credential management | Quarterly |
| On-Call Engineers | Agent incident runbooks, recovery procedures, communication templates | At rotation start |
Tool-Specific Configuration Guides
Claude Code Safety Configuration
# Project Safety Rules ## Environment - This is a STAGING environment. Never target production. - AWS Account: 123456789012 (staging only) - K8s Context: staging-cluster (verify with kubectl config current-context) ## Forbidden Commands - NEVER: terraform destroy, terraform apply -auto-approve - NEVER: kubectl delete namespace, kubectl delete --all - NEVER: git push --force, git push -f - NEVER: rm -rf /, rm -rf *, rm -rf ~ - NEVER: DROP DATABASE, DROP TABLE, TRUNCATE - NEVER: aws * delete-*, aws s3 rb, aws ec2 terminate-instances ## Required Patterns - ALWAYS: terraform plan -out=plan.tfplan BEFORE terraform apply plan.tfplan - ALWAYS: kubectl diff BEFORE kubectl apply - ALWAYS: git checkout -b feature/* for changes (never commit to main) - ALWAYS: Create a backup before modifying any database ## Guardrails - Shell wrappers are in ./bin/ - add to PATH before running commands - Pre-commit hooks are installed - do not bypass with --no-verify
Cursor Configuration
{
"ai.agent.confirmTerminalCommands": true,
"ai.agent.maxSteps": 25,
"ai.agent.blockedCommands": [
"terraform destroy",
"terraform apply -auto-approve",
"kubectl delete namespace",
"git push --force",
"rm -rf /"
],
"terminal.integrated.env.linux": {
"AWS_PROFILE": "agent-readonly",
"KUBECONFIG": "~/.kube/staging-config"
}
}
The Agent Safety Maturity Model
Assess and improve your organization's AI agent safety posture using this 5-level maturity model:
| Level | Name | Description | Key Characteristics |
|---|---|---|---|
| 1 | Ad Hoc | No formal agent safety practices | Agents use personal admin credentials; no guardrails; no policies |
| 2 | Aware | Basic awareness of risks | CLAUDE.md with basic rules; manual review of agent actions; verbal policies |
| 3 | Defined | Formal processes in place | Written policies; scoped credentials; dry-run enforcement; pre-commit hooks |
| 4 | Managed | Automated enforcement | Guardrail scripts; OPA policies; GitOps workflow; deletion protection; incident runbooks |
| 5 | Optimized | Continuous improvement | Automated policy testing; agent behavior monitoring; regular drills; feedback loops |
Future of AI Agent Safety
The field is rapidly evolving. Key trends to watch:
- Built-in agent sandboxing: Agent vendors are building sandboxing directly into their tools (e.g., Codex's network-disabled sandbox mode)
- Agent identity and auditing: Cloud providers will likely introduce agent-specific IAM identities that distinguish agent actions from human actions in audit logs
- Policy-as-code for agents: Expect more frameworks like OPA but specifically designed for AI agent governance
- Agent safety testing: Red-teaming and adversarial testing specifically for agent safety guardrails
- Multi-agent safety: As multi-agent systems become common, safety patterns for agent-to-agent interactions will emerge
- Regulatory frameworks: Governments will likely introduce regulations around autonomous AI systems interacting with critical infrastructure
Frequently Asked Questions
Should I completely ban AI agents from touching infrastructure?
No. AI agents can be tremendously productive for infrastructure work when properly guarded. The goal is not to prevent agent use but to make it safe. With proper guardrails (sandboxed environments, GitOps workflows, scoped credentials, and guardrail scripts), agents can safely write Terraform, create Kubernetes manifests, and debug infrastructure issues.
What's the minimum safety setup I need before using an agent with cloud resources?
At minimum: (1) Use scoped, read-only credentials for the agent, (2) Add a CLAUDE.md or equivalent with forbidden commands, (3) Enable deletion protection on all production databases and storage, (4) Never let the agent directly apply changes to production — use GitOps. This takes about 30 minutes to set up and prevents the most common agent-caused incidents.
How do I convince my team to adopt agent safety practices?
Frame it in terms of risk and cost. A single agent-caused production outage can cost thousands in downtime, hours in recovery time, and significant reputation damage. The safety setup takes 1-2 days and provides ongoing protection. Share the incident examples from this course. Most teams adopt safety practices quickly after seeing what can go wrong.
Do guardrail scripts slow down development?
Minimally. Read operations pass through instantly. Write operations add a brief plan/preview step that takes seconds. The time investment is negligible compared to the hours or days of recovery from an unguarded agent incident. Most developers report that the preview step actually increases their confidence, making them more willing to use agents for complex tasks.
Can I use the same safety practices across all AI agents?
Yes, with tool-specific adaptations. The core principles (least privilege, dry-run, sandboxing, GitOps) apply universally. However, each agent has its own configuration format. CLAUDE.md works for Claude Code, .cursorrules for Cursor, and general CLAUDE.md patterns can inform Copilot usage. Shell wrappers and CI/CD safety patterns work regardless of which agent you use.
What if my agent needs production access for debugging?
Grant temporary, read-only production access through a just-in-time (JIT) access system. Use AWS IAM Access Analyzer or Azure PIM to provide time-limited read access. Never grant write access to production for debugging. If the agent needs to make a production change as part of a fix, it should create a PR that goes through the standard review and deploy process.
How do I monitor agent behavior over time?
Log all agent commands to an audit trail (guardrail script logs, CloudTrail, terminal recordings). Create dashboards that show: (1) commands blocked by guardrails, (2) warning-level commands that were allowed, (3) credentials used, (4) environments targeted. Review these weekly. Patterns in blocked commands reveal where agents commonly try to do something dangerous.
Lilly Tech Systems