Advanced

Best Practices & Safety Checklist

This final lesson brings everything together into an actionable checklist, team policies, tool-specific configuration guides, and a maturity model for measuring your organization's AI agent safety posture.

Complete AI Agent Safety Checklist

Use this checklist before allowing AI agents to interact with any infrastructure:

Before Every Agent Session

Verify the agent is using scoped, task-specific credentials (not personal admin keys)
Confirm the agent is pointed at the correct environment (dev/staging, not production)
Review CLAUDE.md or equivalent config file for safety rules
Ensure guardrail scripts are active and in the $PATH
Verify you have recent backups of any resources the agent might touch

Infrastructure Safety

Deletion protection enabled on all production databases (RDS, DynamoDB, Cloud SQL)
S3/GCS bucket versioning enabled on all production buckets
Azure resource locks applied to critical resource groups
Terraform prevent_destroy lifecycle rule on critical resources
SCPs or organization policies blocking destructive actions from agent accounts
Budget alerts configured with hard caps on agent accounts

Code and Process Safety

Branch protection rules enforced on main/production branches
CODEOWNERS file requiring infrastructure team review for IaC changes
CI/CD pipeline runs plan and posts output on every PR
Destructive plan changes block PR merge automatically
Production apply only happens through CI/CD pipeline, never directly
Agent incident runbooks documented and accessible

Team Policies for AI Agent Usage

Establish clear organizational policies for how AI agents are used:

Markdown - Sample AI Agent Usage Policy

# AI Agent Usage Policy v1.0

## Scope
This policy applies to all team members using AI coding agents
(Claude Code, GitHub Copilot, Cursor, Windsurf, Codex, Aider)
in any project that interacts with cloud infrastructure.

## Required Practices
1. **Credentials:** Use dedicated agent IAM roles with read-only
   access by default. Write access requires team lead approval.
2. **Environment:** Agents must target dev/staging environments.
   Production changes go through GitOps workflow only.
3. **CLAUDE.md:** All projects must have a CLAUDE.md with safety
   rules before agents are used for infrastructure tasks.
4. **Review:** All agent-generated infrastructure changes must be
   reviewed by a human before applying.
5. **Logging:** All agent sessions involving infrastructure must
   be logged (terminal recording or audit log).

## Prohibited Actions
- Running `terraform apply` or `terraform destroy` directly
- Giving agents production database credentials
- Using agents with personal admin AWS/Azure/GCP credentials
- Allowing agents to push directly to main/production branches
- Running agents in full-auto mode for infrastructure tasks

## Incident Reporting
Any unintended infrastructure change caused by an AI agent must
be reported to the infrastructure team within 15 minutes.

Training Recommendations

Ensure your team is prepared for safe AI agent usage:

Audience	Training Focus	Frequency
All Developers	Agent permission models, CLAUDE.md best practices, when to use dry-run	At onboarding + quarterly
Infrastructure Team	Guardrail scripts, OPA policies, CI/CD safety patterns, incident response	Monthly
Team Leads	Agent usage policies, risk assessment, credential management	Quarterly
On-Call Engineers	Agent incident runbooks, recovery procedures, communication templates	At rotation start

Tool-Specific Configuration Guides

Claude Code Safety Configuration

Markdown - CLAUDE.md Safety Template

# Project Safety Rules

## Environment
- This is a STAGING environment. Never target production.
- AWS Account: 123456789012 (staging only)
- K8s Context: staging-cluster (verify with kubectl config current-context)

## Forbidden Commands
- NEVER: terraform destroy, terraform apply -auto-approve
- NEVER: kubectl delete namespace, kubectl delete --all
- NEVER: git push --force, git push -f
- NEVER: rm -rf /, rm -rf *, rm -rf ~
- NEVER: DROP DATABASE, DROP TABLE, TRUNCATE
- NEVER: aws * delete-*, aws s3 rb, aws ec2 terminate-instances

## Required Patterns
- ALWAYS: terraform plan -out=plan.tfplan BEFORE terraform apply plan.tfplan
- ALWAYS: kubectl diff BEFORE kubectl apply
- ALWAYS: git checkout -b feature/* for changes (never commit to main)
- ALWAYS: Create a backup before modifying any database

## Guardrails
- Shell wrappers are in ./bin/ - add to PATH before running commands
- Pre-commit hooks are installed - do not bypass with --no-verify

Cursor Configuration

JSON - .cursor/settings.json

{
  "ai.agent.confirmTerminalCommands": true,
  "ai.agent.maxSteps": 25,
  "ai.agent.blockedCommands": [
    "terraform destroy",
    "terraform apply -auto-approve",
    "kubectl delete namespace",
    "git push --force",
    "rm -rf /"
  ],
  "terminal.integrated.env.linux": {
    "AWS_PROFILE": "agent-readonly",
    "KUBECONFIG": "~/.kube/staging-config"
  }
}

The Agent Safety Maturity Model

Assess and improve your organization's AI agent safety posture using this 5-level maturity model:

Level	Name	Description	Key Characteristics
1	Ad Hoc	No formal agent safety practices	Agents use personal admin credentials; no guardrails; no policies
2	Aware	Basic awareness of risks	CLAUDE.md with basic rules; manual review of agent actions; verbal policies
3	Defined	Formal processes in place	Written policies; scoped credentials; dry-run enforcement; pre-commit hooks
4	Managed	Automated enforcement	Guardrail scripts; OPA policies; GitOps workflow; deletion protection; incident runbooks
5	Optimized	Continuous improvement	Automated policy testing; agent behavior monitoring; regular drills; feedback loops

✅

Target Level 4: Most organizations should aim for Level 4 (Managed) as a minimum for teams using AI agents with infrastructure. Level 5 is aspirational and focuses on continuous improvement through monitoring, drills, and organizational learning.

Future of AI Agent Safety

The field is rapidly evolving. Key trends to watch:

Built-in agent sandboxing: Agent vendors are building sandboxing directly into their tools (e.g., Codex's network-disabled sandbox mode)
Agent identity and auditing: Cloud providers will likely introduce agent-specific IAM identities that distinguish agent actions from human actions in audit logs
Policy-as-code for agents: Expect more frameworks like OPA but specifically designed for AI agent governance
Agent safety testing: Red-teaming and adversarial testing specifically for agent safety guardrails
Multi-agent safety: As multi-agent systems become common, safety patterns for agent-to-agent interactions will emerge
Regulatory frameworks: Governments will likely introduce regulations around autonomous AI systems interacting with critical infrastructure

Frequently Asked Questions

Should I completely ban AI agents from touching infrastructure?

No. AI agents can be tremendously productive for infrastructure work when properly guarded. The goal is not to prevent agent use but to make it safe. With proper guardrails (sandboxed environments, GitOps workflows, scoped credentials, and guardrail scripts), agents can safely write Terraform, create Kubernetes manifests, and debug infrastructure issues.

What's the minimum safety setup I need before using an agent with cloud resources?

At minimum: (1) Use scoped, read-only credentials for the agent, (2) Add a CLAUDE.md or equivalent with forbidden commands, (3) Enable deletion protection on all production databases and storage, (4) Never let the agent directly apply changes to production — use GitOps. This takes about 30 minutes to set up and prevents the most common agent-caused incidents.

How do I convince my team to adopt agent safety practices?

Frame it in terms of risk and cost. A single agent-caused production outage can cost thousands in downtime, hours in recovery time, and significant reputation damage. The safety setup takes 1-2 days and provides ongoing protection. Share the incident examples from this course. Most teams adopt safety practices quickly after seeing what can go wrong.

Do guardrail scripts slow down development?

Minimally. Read operations pass through instantly. Write operations add a brief plan/preview step that takes seconds. The time investment is negligible compared to the hours or days of recovery from an unguarded agent incident. Most developers report that the preview step actually increases their confidence, making them more willing to use agents for complex tasks.

Can I use the same safety practices across all AI agents?

Yes, with tool-specific adaptations. The core principles (least privilege, dry-run, sandboxing, GitOps) apply universally. However, each agent has its own configuration format. CLAUDE.md works for Claude Code, .cursorrules for Cursor, and general CLAUDE.md patterns can inform Copilot usage. Shell wrappers and CI/CD safety patterns work regardless of which agent you use.

What if my agent needs production access for debugging?

Grant temporary, read-only production access through a just-in-time (JIT) access system. Use AWS IAM Access Analyzer or Azure PIM to provide time-limited read access. Never grant write access to production for debugging. If the agent needs to make a production change as part of a fix, it should create a PR that goes through the standard review and deploy process.

How do I monitor agent behavior over time?

Log all agent commands to an audit trail (guardrail script logs, CloudTrail, terminal recordings). Create dashboards that show: (1) commands blocked by guardrails, (2) warning-level commands that were allowed, (3) credentials used, (4) environments targeted. Review these weekly. Patterns in blocked commands reveal where agents commonly try to do something dangerous.

← Previous Incident Response