Multi-Agent Coordination
Patterns for multiple AI agents working together. Orchestrator-worker topologies, debate structures, ensemble voting, and hierarchical coordination.
The Coordination Challenge
When multiple agents participate in a task, coordination becomes the central challenge. Unlike single-agent systems where planning and execution happen within one context, multi-agent systems must address how agents communicate, when they synchronize, how conflicts resolve, and who holds authority over final decisions.
The literature frames this through decentralized partially observable Markov decision processes (Dec-POMDPs), where agents have local observations and must coordinate without full state visibility. While the theoretical framework applies broadly, practical implementations tend toward structured topologies that make coordination tractable—trading generality for reliability.
This document examines coordination patterns that emerge in practice, the trade-offs each pattern makes, and architectural choices that affect system behavior.
Coordination Topologies
Multi-agent systems organize into distinct topologies, each suited to different task characteristics. The choice of topology determines communication patterns, failure modes, and scalability properties.
Orchestrator-Worker
A central orchestrator decomposes tasks, assigns work to specialist agents, and synthesizes results. This pattern provides clear authority and simplifies coordination at the cost of creating a single point of failure and potential bottleneck.
The orchestrator's responsibilities include task decomposition, worker selection, result aggregation, and conflict resolution. Workers operate semi-autonomously on assigned subtasks, reporting results back to the orchestrator. This mirrors classical manager-worker patterns from distributed systems.
Opinion: Orchestrator-worker works well when task decomposition is straightforward and workers have clearly differentiated capabilities. The pattern struggles when tasks require iterative refinement across workers or when the orchestrator lacks domain knowledge to properly decompose and evaluate work.
Pipeline Architecture
Agents arrange into sequential stages, each transforming the output of the previous. Classical pipeline stages include routing (selecting appropriate downstream handlers), aggregation (combining information sources), generation (producing outputs), and validation (checking quality constraints).
Pipelines provide predictable latency and clear data flow. Each stage has well-defined inputs and outputs, making debugging and monitoring straightforward. The pattern naturally supports different model sizes at different stages—lightweight models for routing, capable models for generation.
Opinion: Pipelines excel when processing has natural sequential dependencies. They're less suited for tasks requiring iteration or backtracking, where earlier stages must incorporate feedback from later ones.
Debate Structures
Multiple agents argue for different positions, with structured rounds of critique and response. Debate can operate at multiple levels: internal (within an agent's reasoning), inter-agent (between peer agents), and synthesis (combining positions into final output).
Research on AI debate suggests that structured disagreement can surface errors that single agents miss. The key insight is that it's often easier to critique a position than to generate a correct one from scratch—debate leverages this asymmetry.
Opinion: Multi-level debate architectures work well for complex queries where different perspectives genuinely add value. The overhead isn't justified for straightforward tasks. Effective debate requires careful prompt design to ensure agents engage substantively rather than agreeing superficially.
Ensemble Voting
Multiple agents process the same input independently, then vote on outputs. This pattern draws from ensemble methods in machine learning, where combining multiple weak learners produces stronger predictions than any individual.
Voting mechanisms range from simple majority to weighted schemes based on historical accuracy. The pattern provides natural redundancy—individual agent failures don't necessarily cause system failures if the majority produces correct outputs.
Opinion: Parallel voting works well when agents are genuinely independent (different architectures or training) and the task has clear correctness criteria. It's less effective when agents share common failure modes or when "correctness" requires nuanced judgment rather than verifiable answers.
Strategy Selection
Different queries warrant different coordination intensities. A factual lookup doesn't benefit from multi-agent debate, while a complex analysis might require extensive coordination. Systems benefit from adaptive strategy selection based on query characteristics.
Strategy Spectrum
Coordination strategies exist on a spectrum from minimal to intensive:
- Direct: Single agent, no coordination overhead. Appropriate for simple, well-defined tasks.
- Fast consensus: Two agents, single round. Quick verification without extensive debate.
- Balanced: Three agents, two rounds. Moderate coordination for typical complex queries.
- Deep consensus: Five or more agents, multiple rounds. Intensive coordination for high-stakes decisions.
- Specialist: Domain-specific agents selected based on query content. Expertise-driven rather than quantity-driven.
Automatic Strategy Selection
Rather than requiring users to specify coordination intensity, systems can analyze queries to select appropriate strategies. Factors include query complexity, domain sensitivity, apparent ambiguity, and historical patterns of similar queries.
Opinion: Automatic strategy selection is valuable but imperfect. Complexity estimation based on query text is inherently limited—some simple-looking questions have complex answers. Systems should err toward more coordination when uncertain, with explicit escalation paths when initial strategies prove insufficient.
Convergence and Termination
Multi-round coordination raises the question of when to stop. Fixed round counts are simple but wasteful—some debates converge quickly while others need more rounds. Adaptive termination based on convergence detection can improve efficiency.
Convergence Detection
Convergence occurs when additional rounds produce diminishing changes. Detection approaches include semantic similarity between rounds (are agents repeating themselves?), agreement metrics (are positions becoming more aligned?), and novelty measures (are new arguments still emerging?).
Convergence doesn't guarantee correctness—agents can converge on wrong answers. But persistent divergence often signals genuine ambiguity or complexity that warrants flagging for human review rather than forced resolution.
Early Termination
When agents agree strongly from the first round, additional rounds add latency without value. Early termination based on high initial agreement can significantly reduce average coordination time while maintaining quality for cases that need extended debate.
Opinion: Convergence-based termination works well in practice but requires careful threshold tuning. Too aggressive termination misses cases where initial agreement masks underlying issues. Too conservative termination wastes resources on already-resolved debates.
Synthesis and Aggregation
After coordination rounds complete, results must be synthesized into coherent output. Synthesis is a distinct phase from debate—it combines positions rather than arguing for one. The synthesis agent must identify areas of consensus, acknowledge remaining disagreements, and produce output that reflects the collective deliberation.
Consensus Extraction
Effective synthesis explicitly identifies what agents agreed on versus where they diverged. This transparency helps users understand confidence levels and areas of uncertainty. Simply averaging positions or picking a winner loses valuable information about the deliberation process.
Model Selection for Synthesis
Synthesis is primarily an aggregation task rather than a generation task. This means smaller, faster models often suffice—the creative work happened during debate, and synthesis just needs to organize and combine results. Using lightweight models for synthesis can significantly reduce overall latency and cost.
Opinion: The synthesis phase is often under-designed. Systems focus on the debate mechanics and treat synthesis as an afterthought. But synthesis quality directly determines user-visible output quality. Investing in synthesis prompts that explicitly extract consensus/divergence produces noticeably better final outputs.
Communication Protocols
How agents communicate affects coordination efficiency and quality. Key design choices include message format, visibility rules, and turn-taking protocols.
Structured vs. Natural Language
Agents can communicate in structured formats (explicit fields for claims, evidence, confidence) or natural language. Structured formats enable automated processing and metrics but constrain expression. Natural language is flexible but harder to parse reliably.
Hybrid approaches use natural language for substantive content with structured metadata for coordination signals (confidence levels, explicit agreements/disagreements, requests for clarification).
Full vs. Partial Visibility
Should all agents see all messages, or should communication be filtered? Full visibility enables rich cross-referencing but creates noise as agent count grows. Partial visibility (agents only see messages directed to them) reduces noise but risks missing relevant context.
Opinion: Full visibility works for small agent counts (3-5). Beyond that, some form of attention or filtering mechanism becomes necessary. The orchestrator/synthesizer role can serve as an information bottleneck that manages visibility without explicit filtering rules.
Failure Modes and Robustness
Multi-agent systems introduce failure modes beyond single-agent systems. Understanding these failures informs defensive design.
Groupthink
Agents may converge on incorrect positions, especially when initialized with similar prompts or when early agents' outputs bias later ones. Mitigation includes diverse agent configurations, explicit devil's advocate roles, and independent parallel processing before sharing.
Cascading Failures
If early pipeline stages produce poor output, later stages inherit and potentially amplify errors. Validation stages can catch some issues, but subtle errors may propagate undetected.
Coordination Overhead
Multi-agent coordination adds latency and cost. If coordination doesn't improve output quality proportionally, simpler single-agent approaches may be preferable. Systems should monitor coordination value-add and fall back to simpler patterns when overhead exceeds benefit.
Opinion: The biggest practical failure mode is coordination overhead that doesn't justify itself. Many tasks that seem to need multi-agent coordination can be handled adequately by a single capable agent with good prompting. Multi-agent systems should be reserved for genuinely complex tasks where the overhead demonstrably improves outcomes.
Trade-off Summary
Orchestrator-Worker
Strengths: Clear authority, simple coordination, easy monitoring.
Weaknesses: Single point of failure, bottleneck risk, orchestrator must understand all domains.
Pipeline
Strengths: Predictable latency, clear data flow, stage-appropriate model sizing.
Weaknesses: No backtracking, error propagation, inflexible ordering.
Debate
Strengths: Surfaces errors, handles ambiguity, multiple perspectives.
Weaknesses: High overhead, can stall on irreconcilable positions, requires careful prompt design.
Ensemble Voting
Strengths: Fault tolerance, independent processing, simple aggregation.
Weaknesses: Shared failure modes, majority not always right, cost scales linearly.
Open Questions
- How should credit assignment work when multiple agents contribute to an outcome?
- Can agents learn to coordinate better over time, or is coordination structure necessarily fixed?
- What's the right balance between agent homogeneity (simpler coordination) and heterogeneity (diverse perspectives)?
- How do you maintain coordination efficiency as agent count scales?
- When should systems escalate from automatic coordination to human oversight?
Further Reading
- Bernstein, D. et al. (2002). "The Complexity of Decentralized Control of Markov Decision Processes." Mathematics of Operations Research.
- Foerster, J. et al. (2016). "Learning to Communicate with Deep Multi-Agent Reinforcement Learning." NeurIPS.
- Du, Y. et al. (2023). "Improving Factuality and Reasoning in Language Models through Multiagent Debate." arXiv:2305.14325.
- Liang, T. et al. (2023). "Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate." arXiv:2305.19118.
- Hong, S. et al. (2023). "MetaGPT: Meta Programming for Multi-Agent Collaborative Framework." arXiv:2308.00352.
- Wu, Q. et al. (2023). "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation." arXiv:2308.08155.
Part of the Research Notes series. These notes represent work in progress—ideas being developed in public.