Multi-Agent Chat Flow | Rafael Pérez

Chat interfaces look simple from the outside, but once multiple agents and tools are involved, the architecture becomes much more interesting. A good multi-agent system depends on clear intent detection, reliable routing, safe tool execution, and enough structure to keep the experience natural for the user.

Core concepts

Intent: The underlying goal behind a user’s message.
Utterance: The exact text the user types into the chat that activates the intent.
Slot: A specific piece of extracted data needed to complete the request.
Agent tool: A callable capability an agent uses to read data, trigger an action, or interact with another system.

Basic chat workflow

A simple multi-agent chat flow can be represented like this:

user input from chat interface
-> platform discovers user intent
-> platform calls the right agent
-> agent processes request
-> agent calls the tool
-> tool executes and returns its response
-> agent takes the tool result and replies back to the user

This flow looks straightforward, but one step carries most of the architectural risk: discovering the user’s intent. If the platform gets that part wrong, the wrong agent receives the request, the wrong tool may be called, and the final answer can become irrelevant or even dangerous.

The intent problem

In a multi-agent system, intent discovery is not a nice-to-have feature; it is the routing layer that determines whether the entire interaction succeeds. A natural language interface should understand what the user means without forcing them to manually pick an agent, because making users choose an internal system component breaks the conversational model.

From the developer’s perspective, this raises practical questions such as:

How should free-text messages be classified into valid agent domains?
What should happen when a message could belong to more than one agent?
How should the system behave when the confidence score is low?
Should routing rely on rules, embeddings, an LLM, or a hybrid approach?
How can routing decisions be logged, evaluated, and improved over time?
When should the platform ask a clarification question instead of guessing?
How can the platform prevent an agent from using the wrong tool after a bad routing decision?

From static routing to semantic routing

A common first implementation is a static router with hardcoded assumptions. That approach often works for a demo, but it quickly breaks once users start asking for different kinds of help in plain language.

For example, if every request falls back to an inventory-related agent, a message like “How do I reset my password?” could be interpreted as a warehouse or product lookup. The root problem is not the downstream agent; it is the routing strategy.

The better approach is to move from a static router to a dynamic semantic router. Two common patterns are especially useful: the LLM Supervisor and the Semantic Vector Router.

Pattern 1: The LLM Supervisor

The LLM Supervisor pattern uses a small, fast language model whose only responsibility is to classify the user’s message into one of the available agent categories. Instead of hardcoding routing logic in application code, the platform delegates intent detection to a constrained model call that returns a valid routing key.

This pattern is especially helpful when the agent domains are semantically close and require reasoning to distinguish. It also works well when the router needs to handle messy real-world phrasing, indirect requests, or sentences with missing context.

Example: anonymized Ruby router

class AgentRouter
  def initialize(message:)
    @message = message.to_s
  end

  def call
    agent_key = discover_intent(@message)

    case agent_key
    when "inventory"
      InventoryAgent.handle(message: @message).generate_now!
    when "support"
      SupportAgent.handle(message: @message).generate_now!
    when "billing"
      BillingAgent.handle(message: @message).generate_now!
    else
      GeneralFallbackAgent.handle(message: @message).generate_now!
    end
  end

  private

  def discover_intent(message)
    response = LLMClient.new.chat(
      parameters: {
        model: "small-fast-model",
        response_format: { type: "json_object" },
        messages: [
          {
            role: "system",
            content: "You are a routing assistant. Classify the user message into one of these categories: 'inventory', 'support', or 'billing'. Respond only with JSON like {\"category\":\"inventory\"}."
          },
          {
            role: "user",
            content: message
          }
        ]
      }
    )

    JSON.parse(response.dig("choices", 0, "message", "content"))["category"]
  rescue StandardError
    "fallback"
  end
end

In this pattern, the model acts like a semantic dispatcher. It does not solve the user’s problem directly; it only decides which specialized agent should handle the next step.

Pattern 2: The Semantic Vector Router

The Semantic Vector Router avoids a chat-completion call for routing and instead relies on embeddings and similarity search. This is often faster and cheaper, especially when the available agent domains are clearly different from each other.

What are embeddings?

Embeddings are numeric vector representations of text that encode semantic meaning, so phrases with similar meaning tend to be located close to each other in vector space. In practice, this means the system can compare a user’s message to pre-defined example phrases and choose the nearest semantic match.

Example: anonymized routing approach

class SemanticRouter
  def initialize(embedding_client:)
    @embedding_client = embedding_client
  end

  def call(message)
    message_vector = embed(message)

    scores = {
      inventory: max_similarity(message_vector, inventory_examples),
      support: max_similarity(message_vector, support_examples),
      billing: max_similarity(message_vector, billing_examples)
    }

    best_match = scores.max_by { |_, score| score }
    best_match ? best_match.first.to_s : "fallback"
  end

  private

  def embed(text)
    @embedding_client.embed(text)
  end

  def inventory_examples
    [
      embed("where is the stock"),
      embed("create a warehouse"),
      embed("list available products")
    ]
  end

  def support_examples
    [
      embed("login issue"),
      embed("error on screen"),
      embed("reset my password")
    ]
  end

  def billing_examples
    [
      embed("show my invoice"),
      embed("payment failed"),
      embed("what is my current balance")
    ]
  end

  def max_similarity(message_vector, example_vectors)
    example_vectors.map { |vector| cosine_similarity(message_vector, vector) }.max
  end

  def cosine_similarity(a, b)
    dot_product(a, b) / (magnitude(a) * magnitude(b))
  end

  def dot_product(a, b)
    a.zip(b).sum { |x, y| x * y }
  end

  def magnitude(vector)
    Math.sqrt(vector.sum { |value| value**2 })
  end
end

This pattern works best when the semantic boundaries between agents are distinct enough that nearest-neighbor matching is reliable. It is also a strong option when routing speed matters and the platform needs to operate with minimal latency.

When to use each pattern

The LLM Supervisor is usually the better choice when the system needs reasoning, contextual understanding, or cleaner handling of ambiguous requests. The Semantic Vector Router is a better fit when the domains are clearly separated and the main goal is low-cost, low-latency intent classification.

In practice, many production systems combine both. For example, a vector router can handle obvious cases quickly, while low-confidence matches can be escalated to an LLM supervisor for a second opinion.

Multi-agent complexity in the real world

Multi-agent development becomes complex very quickly because each agent is not only responsible for reasoning, but also for selecting tools, managing state, and deciding when to act. Once several agents and tools are connected to the same chat interface, the architecture must handle intent routing, permission boundaries, retries, tool failures, observability, and conversational continuity at the same time.

The difficulty grows even more when tools are mixed across read-only and update-capable operations. Reading data is usually safe, but writing data changes system state, which means the platform must be much more careful about validation and execution.

Complexities of mixing read-only and update tools

A misclassified intent can trigger a write action when the user only wanted information.
A read-only tool may be safe to run automatically, but an update tool may require confirmation first.
Two agents can read the same state and then perform conflicting updates based on stale data.
Retries become dangerous when a failed request may have partially completed a write operation.
Audit logs become essential because the system must explain not only what it answered, but also what it changed.
Access control becomes more granular because some agents may be allowed to read records but not modify them.
Rollback strategies become necessary when one tool in a multi-step chain succeeds and another fails afterward.
Human approval may be needed for sensitive operations such as payments, deletions, account changes, or inventory adjustments.

Final thoughts

A strong multi-agent chat system is not defined only by the intelligence of its agents, but by the quality of its orchestration. Intent discovery, routing strategy, and safe tool usage are what make the difference between a chat interface that feels smart and one that feels unpredictable.