← Back to Research Architecture

Tool-Using Agent Architectures

Patterns for agents that use external tools. ReAct, plan-then-execute, tool selection, sandboxing, and error handling.

Last updated: December 2024

Beyond Pure Generation

Language models excel at generating text but struggle with tasks requiring precise computation, current information, or real-world actions. Tool use extends model capabilities by connecting them to external systems: calculators, search engines, databases, APIs, code executors.

This isn't a new idea. Karpas et al. (2022) described MRKL systems—modular architectures combining language models with symbolic tools. Schick et al. (2023) showed that models can learn when and how to call tools through self-supervision in Toolformer. Yao et al. (2023) introduced ReAct, demonstrating that interleaving reasoning and action improves both.

The key insight: language models are good at deciding what to do; external tools are good at actually doing it. Effective architectures combine these strengths while managing the complexity of coordination.

Agent Architecture Patterns

Tool-using agents follow several common patterns, each with different trade-offs between planning depth, execution speed, and error recovery.

ReAct: Reasoning and Acting

The ReAct pattern interleaves reasoning steps with tool invocations. The model thinks about what to do, executes an action, observes the result, and reasons again. This tight loop enables reactive adaptation but limits lookahead.

ReAct works well for exploratory tasks where the next step depends heavily on previous results. It's less suited for tasks requiring coordinated multi-step execution—each step is planned in isolation.

Plan-Then-Execute

Separate planning from execution: first generate a complete plan, then execute it step by step. This enables optimization across steps, parallel execution of independent tasks, and clearer progress tracking.

Opinion: Plan-then-execute is underrated for complex tasks. Most agent frameworks default to ReAct-style step-by-step execution, losing the opportunity to optimize holistically. A plan with explicit dependencies enables parallel execution; ReAct is inherently sequential.

┌─────────────────────────────────────────────────────────────────┐
│ PLAN-THEN-EXECUTE WORKFLOW                                      │
├─────────────────────────────────────────────────────────────────┤
│ 1. PLANNING PHASE                                               │
│    ├── Goal decomposition                                       │
│    ├── Task dependency analysis                                 │
│    ├── Resource estimation                                      │
│    └── Rollback strategy                                        │
│                                                                  │
│ 2. APPROVAL PHASE                                               │
│    └── Human or automated review                                │
│                                                                  │
│ 3. EXECUTION PHASE                                              │
│    ├── Task dispatch (sequential or parallel)                   │
│    ├── Progress tracking                                        │
│    ├── Error handling and retry                                 │
│    └── Rollback if needed                                       │
└─────────────────────────────────────────────────────────────────┘

Hybrid Approaches

Combine planning with reactive adaptation: plan the overall strategy, but allow re-planning when execution reveals unexpected conditions. This captures benefits of both approaches at the cost of complexity.

Shinn et al. (2023) introduced Reflexion—agents that learn from failures through verbal self-critique. Failed attempts inform subsequent plans, enabling adaptation without full re-planning.

Tool Registry Design

Agents need to know what tools are available, what they can do, and how to use them. The tool registry is the catalog that enables tool selection.

Schema-Based Registration

Each tool is registered with a schema describing its interface: required parameters, optional parameters, return types, and constraints. Schemas enable type checking, documentation generation, and model prompting.

Common schema formats include JSON Schema, OpenAPI, and custom specifications. The choice matters less than consistency—models need reliable descriptions to generate valid tool calls.

Capability Categories

Tools can be categorized by capability type:

  • Information retrieval — Search, database queries, API calls
  • Computation — Calculators, code execution, data processing
  • State modification — File writes, database updates, API mutations
  • External interaction — Email, messaging, webhook triggers

Opinion: Capability categories should inform permission models. Information retrieval is generally safe; state modification requires more scrutiny. Treating all tools equally ignores meaningful risk differences.

Dynamic vs. Static Registries

Static registries define available tools at system startup. Dynamic registries allow tools to be added, removed, or modified at runtime. Dynamic registries enable extensibility but complicate reasoning about system capabilities.

Tool Selection

Given a task, which tool should handle it? Selection involves matching task requirements to tool capabilities.

Pattern-Based Routing

Route based on input patterns: code-related tasks to code tools, search queries to search tools, calculations to calculators. Pattern matching is fast but requires manual rule maintenance.

Capability Matching

Match task requirements to tool capabilities through semantic comparison. More flexible than pattern matching but computationally expensive.

Task-Aware Selection

Different tasks benefit from different tools even within the same category. For code generation, language matters—Python tasks might route to one model, TypeScript tasks to another. Complex tasks might route to more capable (and more expensive) tools.

Opinion: Task-aware selection provides significant quality improvements. A single "best" tool for all tasks is rarely optimal. The selection logic should consider task type, complexity, and domain when routing.

Execution Strategies

Once tools are selected, how should execution proceed?

Sequential Execution

Execute tasks one at a time, in order. Simple and predictable, but slow for independent tasks that could run in parallel.

Parallel Execution

Execute independent tasks simultaneously. Requires dependency analysis to identify which tasks can safely run in parallel. Faster but more complex to coordinate.

Dependency-Based Scheduling

Build a dependency graph, then schedule tasks as their dependencies complete. This generalizes sequential and parallel execution—sequential is a linear graph, full parallelism is an independent set.

Dependency-based scheduling requires explicit dependency declaration in plans. Tasks specify which other tasks they depend on; the scheduler ensures dependencies complete before dependent tasks start.

Isolated Execution Environments

For tasks that modify state (file writes, code execution), isolation prevents interference between parallel tasks. Git worktrees, containers, or virtual environments provide isolation with different trade-offs.

Opinion: Isolation is essential for reliable parallel execution. Shared mutable state causes subtle bugs when tasks run concurrently. The overhead of isolation is usually worth the reliability improvement.

Validation and Safety

Tool outputs require validation before use. Generated code might have syntax errors; API responses might be malformed; actions might violate constraints.

Multi-Layer Validation

Apply multiple validation checks at different levels:

  • Syntax validation — Is the output well-formed?
  • Type validation — Does it match expected types?
  • Semantic validation — Does it make sense in context?
  • Safety validation — Does it violate any constraints?

Different validations have different blocking behavior. Syntax errors should block execution; style warnings might just be logged.

Security Scanning

For code generation, scan for common vulnerabilities: SQL injection patterns, command injection, hardcoded secrets, dangerous function calls. These checks are heuristic but catch common issues before they reach production.

Opinion: Validation should distinguish blocking issues from warnings. Syntax errors prevent execution; style violations are worth noting but shouldn't block. Treating all issues as blocking creates friction; treating all issues as warnings allows problems through.

Rollback Planning

Actions that modify state should have rollback plans. If execution fails partway through, the system should know how to restore the previous state. Rollback plans should be part of the initial planning phase, not an afterthought.

Error Handling

Tool execution fails. Networks time out, APIs return errors, code throws exceptions. Robust agents handle failures gracefully.

Retry Strategies

Transient failures often succeed on retry. Exponential backoff with jitter prevents thundering herds. Maximum retry limits prevent infinite loops.

Fallback Tools

When a primary tool fails, try alternatives. If one search API is down, try another. If a complex model times out, try a simpler one. Fallback chains increase reliability at the cost of potential quality degradation.

Graceful Degradation

When tools fail and fallbacks are exhausted, degrade gracefully rather than failing completely. Report partial results, explain what couldn't be done, suggest manual alternatives.

Error Classification

Different errors require different responses:

  • Transient — Retry after delay (network timeout)
  • Persistent — Try fallback or escalate (invalid credentials)
  • Fatal — Abort and report (missing required resource)

Classifying errors enables appropriate responses. Retrying fatal errors wastes resources; immediately failing on transient errors loses easy wins.

Confidence and Feedback

Tool execution produces outputs, but how confident should we be in them?

Confidence Estimation

Estimate confidence based on execution signals: Did the tool report success? Did validation pass? How many warnings were generated? For multi-model coordination, did participants reach consensus?

Opinion: Consensus-based confidence is more reliable than single-source confidence. When multiple models or tools agree, confidence increases. Disagreement signals uncertainty even if individual sources report high confidence.

Progress Tracking

For multi-step execution, track progress explicitly: which tasks are pending, in progress, completed, or failed. Progress tracking enables resumption after interruption and provides visibility into execution state.

Outcome Recording

Record execution outcomes for learning. Which tools succeeded? Which failed? What errors occurred? Over time, this data informs tool selection and retry strategies.

Task Types and Operations

Different task types require different handling. A common taxonomy:

┌─────────────────────────────────────────────────────────────────┐
│ TASK TYPE             │ DESCRIPTION                             │
├─────────────────────────────────────────────────────────────────┤
│ create_file           │ Create new file from scratch            │
│ modify_file           │ Edit existing file                      │
│ delete_file           │ Remove file                             │
│ create_test           │ Generate test file                      │
│ install_dependency    │ Add package dependency                  │
│ database_migration    │ Create schema change                    │
│ config_change         │ Modify configuration                    │
└─────────────────────────────────────────────────────────────────┘

Each task type has different validation requirements, rollback strategies, and risk levels. Modifications are riskier than reads; deletions are riskier than creations. The taxonomy enables appropriate handling per task type.

Implementation Trade-offs

Planning Depth vs. Latency

Deeper planning catches more issues upfront but delays execution. Shallow planning starts faster but may require more replanning. The optimal depth depends on task complexity and cost of mistakes.

Tool Specialization vs. Generality

Specialized tools excel at narrow tasks; general tools handle more situations adequately. Specialization improves quality but requires more tools and more complex selection logic.

Autonomy vs. Oversight

More autonomous agents complete tasks faster but with less human oversight. More supervised agents are safer but slower. The right balance depends on task risk and trust in the system.

Opinion: Autonomy should scale with demonstrated reliability. New tools should require more oversight; well-tested tools can operate more independently. Trust is earned through track record, not assumed.

Open Questions

  1. How do we compose tools safely? — Individual tools may be safe, but combinations might not be. Composition analysis remains hard.
  2. What's the right granularity for tools? — Too coarse loses flexibility; too fine creates coordination overhead.
  3. Can tool use be learned end-to-end? — Current systems mostly engineer tool use; can it emerge from training?
  4. How do we handle tool evolution? — Tools change over time; how do agents adapt to new capabilities and deprecated features?
  5. What's the right interface between reasoning and acting? — ReAct interleaves tightly; plan-then-execute separates cleanly. Are there better intermediate points?

Conclusion

Tool-using agents extend language model capabilities into domains requiring precise computation, current information, or real-world action. The architecture choices—ReAct vs. plan-then-execute, static vs. dynamic registries, sequential vs. parallel execution—shape system behavior and reliability.

The key insight is separation of concerns: language models decide what to do; specialized tools do it; validation confirms correctness; error handling manages failures. Each component does what it's good at, and the architecture coordinates their collaboration.

As agents take on more complex tasks, tool-using architecture becomes more important. The patterns established in systems like ReAct, Toolformer, and MRKL provide foundations; the challenge is adapting them to specific domains while maintaining reliability and safety.

Further Reading

Foundational Papers

  • Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models — Interleaved reasoning and action
  • Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools — Learned tool use
  • Karpas, E., et al. (2022). MRKL Systems: A Modular, Neuro-Symbolic Architecture — LLMs with symbolic tools

Agent Frameworks

  • Park, J. S., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior — Persistent agent simulation
  • Wang, G., et al. (2023). Voyager: An Open-Ended Embodied Agent with LLMs — Autonomous skill acquisition
  • Shinn, N., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning — Learning from failures

Planning and Reasoning

  • Yao, S., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models — Structured reasoning
  • Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Reasoning traces

Benchmarks and Evaluation

  • Liu, X., et al. (2023). AgentBench: Evaluating LLMs as Agents — Multi-task agent evaluation
  • Zhou, S., et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents — Real-world task environments

Safety and Alignment

  • Anthropic. Core Views on AI Safety — Risks of capable agents
  • OpenAI. Our Approach to AI Safety — Safety considerations for tool use