Advanced

Best Practices & Checklist

This final lesson consolidates everything into actionable checklists, organization-level policies, and answers to the most common questions about safely using AI agents with cloud infrastructure.

Comprehensive Checklist for Safe AI Agent Cloud Usage

💡
AI Agent Cloud Safety Checklist:
  • AI agents use dedicated service accounts, never personal or admin credentials
  • All agent credentials are time-limited (1 hour max session duration)
  • Explicit Deny policies block all delete/terminate/destroy operations
  • Permission boundaries are applied to all agent IAM roles
  • Human approval is required for any infrastructure modification in production
  • CI/CD pipelines with approval gates are used instead of direct agent execution
  • Terraform prevent_destroy is set on all critical resources
  • Cloud-native deletion protection is enabled on all production resources
  • S3 Object Lock / Azure Immutable Blob / GCS Retention is enabled for critical data
  • Real-time alerts are configured for all destructive API calls by agent accounts
  • Agent activity dashboards are monitored by the operations team
  • Incident response playbook is documented and tested quarterly
  • Agent auto-approve/YOLO mode is disabled for all cloud CLI operations
  • State files (Terraform/Pulumi) are stored remotely with versioning and locking
  • Emergency kill switch procedures are documented and accessible

Organization-Level Policies

Policy Area Requirement Enforcement
Agent Onboarding All AI agent tools must be approved by security team before use Software allow-list, procurement controls
Credential Management No long-lived credentials for agent accounts; 1-hour max sessions AWS SCP, Azure Policy, GCP Organization Policy
Environment Isolation Agent credentials are scoped to a single environment (dev/staging/prod) Separate AWS accounts, Azure subscriptions, GCP projects per env
Audit Requirements All agent actions must be logged and retained for 90 days minimum CloudTrail, Activity Logs, Audit Logs with retention policies
Incident Reporting Any agent-caused incident must be reported within 1 hour PagerDuty/OpsGenie integration, Slack alerting

Team Training and Awareness

  1. Mandatory Onboarding Training

    Every developer who uses AI coding agents must complete this course before being granted agent-compatible cloud credentials. Include a practical exercise where they configure least-privilege policies.

  2. Monthly Safety Reviews

    Review agent activity logs monthly. Identify patterns of risky behavior, near-misses, and successful safety interventions. Share findings in team retrospectives.

  3. Incident Simulations

    Run quarterly tabletop exercises where the team simulates an AI agent accidentally deleting production resources. Practice the full incident response workflow from detection to recovery.

Regular Access Reviews

AI agent permissions should be reviewed more frequently than human permissions because the risk profile is different:

  • Weekly: Review failed authorization attempts from agent accounts (indicates the agent tried something it should not)
  • Monthly: Audit all permissions granted to agent service accounts against actual usage
  • Quarterly: Full access review with security team sign-off on all agent IAM policies
  • After every incident: Immediate review and tightening of affected agent permissions

Testing Agents in Sandboxed Environments

Before allowing an AI agent to interact with any real environment, test it in an isolated sandbox:

Setting up an isolated sandbox for agent testing
# AWS: Create a dedicated sandbox account via AWS Organizations
# This account has no connectivity to production accounts
aws organizations create-account \
  --email ai-agent-sandbox@company.com \
  --account-name "AI Agent Sandbox"

# Apply an SCP that prevents any cross-account access
# and limits spending to $100/month
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyCrossAccountAccess",
      "Effect": "Deny",
      "Action": "sts:AssumeRole",
      "Resource": "arn:aws:iam::*:role/*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalAccount": "${aws:ResourceAccount}"
        }
      }
    }
  ]
}

# Let the agent do whatever it wants in the sandbox
# Monitor its behavior, then create appropriate policies
# for staging/production based on observed needs

Emergency Kill Switches

Emergency Procedures: Every team using AI agents with cloud access must have documented and tested kill switch procedures:
  • Immediate: Disable the agent's IAM access key or service account
  • Quick: Apply an emergency SCP/Policy that denies all actions for agent principals
  • Terminal: Kill the agent's terminal session or container
Emergency kill switch scripts
#!/bin/bash
# save as: kill-agent-access.sh
# Run this immediately when an agent is behaving dangerously

# AWS: Deactivate the agent's access key
aws iam update-access-key \
  --user-name ai-agent-deployer \
  --access-key-id AKIA... \
  --status Inactive

# Azure: Disable the service principal
az ad sp update \
  --id "ai-agent-sp-object-id" \
  --set accountEnabled=false

# GCP: Disable the service account
gcloud iam service-accounts disable \
  ai-agent@project.iam.gserviceaccount.com

echo "Agent access has been revoked across all clouds."
echo "Proceed to assess damage and begin recovery."

Frequently Asked Questions

Can I trust AI agents to manage production infrastructure?

AI agents can be valuable for infrastructure tasks, but they should never have unsupervised access to production. Use them for generating plans and code, but always require human approval before applying changes to production. Think of AI agents as powerful assistants that need guardrails, not autonomous operators.

What if my AI agent needs delete permissions for legitimate tasks?

Some tasks genuinely require delete permissions (cleaning up development resources, rotating secrets). For these cases, use time-limited credential elevation: the agent requests elevated permissions, a human approves, and the permissions automatically expire after the task window (15-60 minutes). Never grant permanent delete permissions to agent accounts.

How do I handle AI agents in CI/CD pipelines?

AI agents in CI/CD should generate the infrastructure code or commands, but the actual execution should happen through the pipeline's own service account with appropriate environment-specific gates. Use GitHub Environments, GitLab Protected Environments, or similar features to require manual approval for production deployments.

Is it safe to use AI agents with Terraform?

Yes, with proper controls. AI agents are excellent at writing Terraform code. The key rules: (1) Never let agents run terraform apply or terraform destroy directly, (2) Set prevent_destroy on all critical resources, (3) Use remote state with locking, (4) Require PR reviews for all Terraform changes, and (5) Use terraform plan output as a review step before any apply.

What about costs? Can AI agents accidentally run up cloud bills?

Yes, cost is a real risk. AI agents might create expensive resources (GPU instances, large databases) without understanding the cost implications. Set up billing alerts, use AWS Budgets / Azure Cost Management / GCP Budget Alerts, and include cost-related permissions in your agent policy (e.g., deny creation of instance types above a certain size).

How often should I review and update agent permissions?

Review weekly for anomalies, monthly for permission right-sizing, and quarterly for a full security audit. Additionally, review immediately after any incident or near-miss. Use cloud-native tools like AWS IAM Access Analyzer, Azure AD Access Reviews, and GCP IAM Recommender to identify unused permissions that should be removed.

What is the minimum viable safety setup for a small team?

At minimum: (1) Create a dedicated IAM role for agent use with no delete permissions, (2) Enable deletion protection on your production database and critical storage, (3) Set up a single email alert for any destructive API call in your account, and (4) Keep the agent's shell approval prompts enabled (never use auto-approve mode). These four steps take less than an hour and prevent the majority of agent-caused incidents.

Course Complete: You now have a comprehensive understanding of AI agent cloud safety. Start by implementing the minimum viable safety setup (dedicated credentials + deletion protection + alerts), then progressively add more controls as your team's use of AI agents grows. The key principle: AI agents are powerful tools that need guardrails, not replacements for human judgment on destructive operations.