Advanced
Building a Complete Agent
Walk through building a fully functional AI agent from scratch. We cover requirements, LLM selection, tool design, memory, testing, deployment, monitoring, and safety guardrails.
Step 1: Requirements Gathering
Before writing code, define what your agent needs to do:
- Goal: What task(s) should the agent accomplish?
- Scope: What are the boundaries? What should it NOT do?
- Tools needed: What external systems must it interact with?
- User interaction: Fully autonomous or human-in-the-loop?
- Quality bar: How good does it need to be? What is the cost of errors?
- Volume: How many tasks per day? Concurrent users?
Step 2: Choosing the LLM
| Requirement | Recommended Model |
|---|---|
| Complex reasoning, high accuracy | Claude Opus 4, o3 |
| Good balance, most use cases | Claude Sonnet 4, GPT-4o |
| High volume, cost-sensitive | GPT-4o mini, Gemini Flash |
| Privacy-critical, self-hosted | Llama 3.3 70B, Qwen 2.5 72B |
Step 3: Complete Agent Implementation
Python - Complete Research Agent
import anthropic import json from datetime import datetime class ResearchAgent: def __init__(self): self.client = anthropic.Anthropic() self.model = "claude-sonnet-4-20250514" self.max_steps = 15 self.tools = self._define_tools() self.system_prompt = """You are a research agent. Given a research question, use your tools to find information, analyze it, and produce a report. Always cite your sources. If you cannot find reliable information, say so rather than speculating.""" def _define_tools(self): return [ { "name": "web_search", "description": "Search the web for information", "input_schema": { "type": "object", "properties": { "query": {"type": "string"} }, "required": ["query"] } }, { "name": "read_url", "description": "Read content from a URL", "input_schema": { "type": "object", "properties": { "url": {"type": "string"} }, "required": ["url"] } }, { "name": "save_report", "description": "Save the final research report", "input_schema": { "type": "object", "properties": { "title": {"type": "string"}, "content": {"type": "string"} }, "required": ["title", "content"] } } ] def run(self, question): """Execute the research agent loop.""" messages = [{ "role": "user", "content": f"Research this: {question}" }] for step in range(self.max_steps): response = self.client.messages.create( model=self.model, max_tokens=4096, system=self.system_prompt, tools=self.tools, messages=messages ) # Add assistant response to history messages.append({ "role": "assistant", "content": response.content }) # If no tool use, agent is done if response.stop_reason == "end_turn": return self._extract_text(response) # Execute each tool call tool_results = [] for block in response.content: if block.type == "tool_use": result = self._execute(block) tool_results.append({ "type": "tool_result", "tool_use_id": block.id, "content": result }) messages.append({ "role": "user", "content": tool_results }) return "Max steps reached" # Usage agent = ResearchAgent() report = agent.run("Latest advances in AI agents") print(report)
Step 4: Testing Agents
Agent testing is harder than traditional software testing because behavior is non-deterministic:
- Unit test tools: Test each tool function independently with known inputs and expected outputs
- Scenario testing: Run the agent against a set of predefined scenarios and check outcomes
- Regression testing: Keep a suite of tasks the agent should handle and verify after changes
- Adversarial testing: Try edge cases, ambiguous inputs, and adversarial prompts
- Cost monitoring: Track token usage per task to catch cost regressions
Step 5: Deployment
- API server: Wrap the agent in a REST API (FastAPI, Flask) for integration
- Queue-based: For async tasks, use a job queue (Redis, SQS) to process requests
- Containerized: Docker containers for consistent deployment
- Serverless: AWS Lambda or Google Cloud Functions for event-driven agents
Step 6: Safety Guardrails
Essential safety measures for production agents:
- Action whitelists: Explicitly define allowed actions. Deny by default.
- Rate limits: Cap the number of actions per minute and per task
- Budget limits: Set maximum token/cost budget per task
- Sandboxing: Execute code in sandboxed environments (Docker, gVisor)
- Human approval: Require human confirmation for high-risk actions (delete, send, purchase)
- Audit logging: Log every action for review and debugging
Step 7: Monitoring
In production, monitor these metrics:
- Task completion rate: Percentage of tasks successfully completed
- Average steps per task: Efficiency metric; fewer steps is better
- Cost per task: Token usage and API costs
- Error rate: Tool failures, LLM errors, timeout rates
- Latency: Time from request to completion
- User satisfaction: Ratings, feedback, escalation rates
Lilly Tech Systems