Multi-Agent Architecture, Part 9: Evaluation — Scoring and Improving Agent Conversations

This is Part 9 of the Multi-Agent Architecture series. In Part 8 we covered observability — how to see what your system is doing. Today we close the loop: how do you know if what it is doing is actually good?

The Problem Nobody Wants to Talk About

You shipped your multi-agent system. Agents are routing, handoffs are flowing, tools are executing. The observability dashboard is green. But here is the question that keeps me up at night: are the agents actually helping?

Observability tells you the system is running. Evaluation tells you the system is working. There is a meaningful difference. A conversation can complete without errors, hit every checkpoint, resolve in under two minutes, and still leave the customer worse off than when they started — because the agent gave a confident, well-formatted, completely wrong answer.

Traditional software has test suites. You write assertions, run them, and get a binary pass/fail. Multi-agent conversations do not work that way. The output is natural language. The quality depends on context, tone, accuracy, and whether the agent actually solved the user's problem. You cannot assertEqual your way through that.

RoomKit approaches this with a combination of structured observations, task tracking, phase audit trails, and feedback loops that let you score conversations continuously — not just in a test harness, but in production.

Observations: Structured Quality Judgments

The core primitive for evaluation in RoomKit is the Observation model. An observation is a structured finding about a conversation — created by an evaluator (human or AI), attached to a specific room, and stored alongside the conversation data it describes.

from roomkit.models import Observation

# A supervisor agent evaluates a support agent's response
observation = Observation(
    room_id="support-case-7742",
    category="response_quality",
    confidence=0.87,
    content="Agent provided correct billing info but missed the refund eligibility window",
    metadata={
        "agent_id": "support-agent",
        "severity": "medium",
        "suggestion": "Check refund_policy tool before confirming timelines",
    },
)

# Store the observation via the conversation store
await kit.store.add_observation(observation)

Each observation has three fields that matter. The category is a string you define — "response_quality", "hallucination_check", "tone_assessment", "policy_compliance", whatever dimensions matter for your system. The confidence is a float between 0 and 1, expressing how certain the evaluator is about the finding. And the content string plus metadata dict carry the structured context: which agent was evaluated, what the issue was, and what should change.

Observations are not metrics. They are not logs. They are judgment records — structured opinions about conversation quality that accumulate over time and feed back into the system.

LLM-as-Judge: Automated Quality Scoring

Creating observations manually does not scale. The pattern that works in production is LLM-as-judge: you dedicate a supervisor agent to evaluate other agents' responses in real time.

In RoomKit, the supervisor channel receives every event in the conversation, even when it is not the active responder. This makes it the natural home for evaluation logic. The supervisor watches the conversation, judges each response, and creates observations via tool calls.

# Supervisor agent's system prompt includes evaluation instructions
supervisor_prompt = """You are a quality supervisor. You observe every message
in this conversation but do NOT respond to the user directly.

After each agent response, evaluate it on these dimensions:
- Accuracy: Did the agent provide correct information?
- Completeness: Did the agent address the full question?
- Tone: Was the response appropriate for the context?
- Policy: Did the agent follow company guidelines?

Use the create_observation tool to record your findings."""

# The supervisor's tool handler for creating observations
async def supervisor_tools(tool_name: str, arguments: dict) -> str:
    if tool_name == "create_observation":
        observation = Observation(
            room_id=current_room_id,
            category=arguments["category"],
            confidence=arguments.get("confidence", 1.0),
            content=arguments["content"],
            metadata=arguments.get("metadata", {}),
        )
        await kit.store.add_observation(observation)
        return json.dumps({"status": "recorded"})
    return json.dumps({"error": f"Unknown tool: {tool_name}"})

The supervisor is just another AI channel in the room, but its purpose is purely evaluative. It does not answer user questions. It watches, scores, and records. Every response the support agent gives, the supervisor creates an observation with a category, a confidence score, and structured data explaining the judgment.

Over hundreds of conversations, these observations build a quality profile for each agent. You can aggregate by category to find patterns: maybe the billing agent scores well on accuracy but poorly on tone. Maybe the tech support agent is great at simple issues but hallucinates on edge cases. The observations tell you where to focus your prompt engineering.

Hooks as Evaluation Checkpoints

Not every evaluation needs an LLM. Some quality signals are structural — they come from how the conversation flows, not what was said. RoomKit's hook system fires at key moments where quality can be measured directly.

from roomkit.models.enums import HookTrigger
from roomkit.models import Observation
from roomkit.models.event import RoomEvent
from roomkit.models.context import RoomContext
from roomkit.orchestration.state import get_conversation_state

@kit.hook(HookTrigger.ON_PHASE_TRANSITION)
async def track_escalations(event: RoomEvent, context: RoomContext) -> None:
    # Extract conversation state from room metadata
    state = get_conversation_state(context.room)
    if not state:
        return

    # Flag conversations that escalate too quickly
    escalations = [p for p in state.phase_history if p.to_phase == "escalation"]
    handling = [p for p in state.phase_history if p.to_phase == "handling"]

    if len(escalations) > 1:
        await kit.store.add_observation(Observation(
            room_id=context.room.id,
            category="escalation_pattern",
            confidence=0.95,
            content="Multiple escalations indicate agent unable to resolve",
            metadata={
                "escalation_count": len(escalations),
                "handling_attempts": len(handling),
            },
        ))

@kit.hook(HookTrigger.ON_HANDOFF)
async def track_handoff_quality(event: RoomEvent, context: RoomContext) -> None:
    # Record context about every agent transfer
    await kit.store.add_observation(Observation(
        room_id=context.room.id,
        category="handoff_analysis",
        confidence=0.90,
        content=f"Handoff from {event.metadata.get('source_agent', 'unknown')} to {event.metadata.get('target_agent', 'unknown')}",
        metadata={
            "reason": event.metadata.get("reason", "unspecified"),
        },
    ))

@kit.hook(HookTrigger.ON_HANDOFF_REJECTED)
async def track_rejected_handoffs(event: RoomEvent, context: RoomContext) -> None:
    # Rejected handoffs are a strong quality signal
    await kit.store.add_observation(Observation(
        room_id=context.room.id,
        category="handoff_rejected",
        confidence=0.99,
        content="Routing attempted handoff to unavailable or inappropriate agent",
        metadata={
            "attempted_target": event.metadata.get("target_agent", "unknown"),
            "rejection_reason": event.metadata.get("reason", "unknown"),
        },
    ))

Three hooks, three different quality signals — none requiring an LLM call. The ON_PHASE_TRANSITION hook extracts ConversationState from room metadata and examines phase_history to detect problematic patterns like repeated escalations. Each PhaseTransition record has from_phase, to_phase, reason, and a timestamp, giving you a precise audit trail. The ON_HANDOFF hook records context about every agent transfer. The ON_HANDOFF_REJECTED hook catches routing failures. Together, they build a structural quality profile of every conversation.

Task Completion as an Evaluation Metric

Observations capture qualitative judgments. But some evaluation is quantitative: did the agent actually finish what it was supposed to do?

RoomKit's Task model tracks discrete work items that agents perform during a conversation. Each task has a status — pending, completed, or failed — and task completion rates become a natural evaluation metric.

If a support agent creates a task to "process refund for order #4481" and that task stays in pending after the conversation ends, something went wrong. If an agent marks a task completed but the supervisor's observation flags inaccurate information, you have a discrepancy worth investigating. The combination of tasks and observations catches both omission errors (agent did not finish the work) and commission errors (agent finished but did it wrong).

Feedback Loops: Observations Driving Routing

Observations become powerful when they feed back into the system. The simplest feedback loop: use accumulated observation scores to influence routing decisions.

from roomkit.orchestration import ConversationRouter, RoutingRule, RoutingConditions

async def build_quality_aware_router(room):
    # Pull recent observations for each agent from the store
    observations = await kit.store.list_observations(room.id)
    agent_scores = {}
    for obs in observations:
        if obs.category == "response_quality":
            agent_id = obs.metadata.get("agent_id")
            if agent_id:
                agent_scores.setdefault(agent_id, []).append(obs.confidence)

    # Calculate average quality score per agent
    avg_scores = {
        agent: sum(scores) / len(scores)
        for agent, scores in agent_scores.items()
    }

    # Build routing rules that prefer higher-scoring agents
    rules = []
    for agent_id, score in sorted(avg_scores.items(), key=lambda x: x[1], reverse=True):
        # Lower priority number = higher preference
        priority = int((1 - score) * 100)
        rules.append(RoutingRule(
            agent_id=agent_id,
            priority=max(priority, 1),
            conditions=RoutingConditions(),
        ))

    return ConversationRouter(rules=rules)

This is a basic version, but the principle scales. Agents with consistently high observation scores get lower priority numbers (higher preference). Agents with low scores get deprioritized. The router effectively learns which agents perform well and sends traffic accordingly.

You can make this more sophisticated: weight recent observations more heavily, factor in task completion rates, consider category-specific scores (an agent might score well on accuracy but poorly on tone — route formal requests to it, informal ones elsewhere). The observation data is structured, so the routing logic can be as nuanced as you need.

Regression Tracking

Individual conversation evaluation matters, but the real value is longitudinal. Are your agents getting better or worse over time?

Because observations are stored per-room with timestamps, you can track quality trends across your entire system. A prompt change that improves average accuracy scores by 5% shows up in the observation data. A provider model update that causes a regression in tone scores shows up too — and you catch it before customers start complaining.

The phase history adds another dimension. If average time-in-escalation is climbing, that is a systemic quality signal even if individual observation scores look fine. The circuit breaker state on your delivery channels tells a reliability story: if a channel is spending more time in open or half-open states, delivery failures are affecting the user experience regardless of how good the agent responses are.

Delivery results via DeliveryResult close the last gap. Every message sent through a channel returns a result with success/error status and the provider's message ID. A perfectly evaluated conversation means nothing if 20% of the responses never reached the user. Tracking delivery success rates alongside observation scores gives you the full picture.

Designing Good Test Cases

Evaluation in production is essential, but you also need repeatable test conversations to validate changes before they ship. A good evaluation test case is not a unit test — it is a scripted conversation with expected quality outcomes.

The approach I use with RoomKit: define a conversation scenario (a sequence of user messages), run it through the system, collect the observations the supervisor creates, and compare them against expected thresholds. If the accuracy observation drops below 0.8 after a prompt change, the change does not ship. If the escalation count for a scenario increases, investigate before deploying.

This is not binary pass/fail. It is threshold-based scoring with human review for edge cases. The observations provide the scores. The phase history provides the behavioral trace. Together, they give you enough signal to make deployment decisions with confidence.

Closing the Series

This is Part 9, the final pillar. If you have followed the series from the introduction, we have walked the full path: from how users enter the system (interaction), through how agents are chosen (orchestration), what they know (knowledge), what they remember (storage), how they think (agents), what they can do (integration, external tools), how you watch them (observability), and now — how you judge and improve them (evaluation).

The through-line across all nine pillars is that multi-agent systems are engineering problems, not AI problems. The LLM call is one function. Everything around it — routing, state, tools, evaluation — is what determines whether your system works in production or just works in a demo.

RoomKit was built on the premise that conversations are the organizing primitive. Rooms scope state. Channels separate concerns. Hooks give you control points. Observations close the feedback loop. Each pillar maps to a concrete abstraction in the framework, and each abstraction composes with the others without coupling.

Build systems that actually work. Then prove that they do.

This article is the final part of a 9-part series on production-ready multi-agent architecture. Read the series introduction for the full roadmap.

Series: Introduction · Part 1: User Interaction · Part 2: Orchestration · Part 3: Knowledge · Part 4: Storage · Part 5: Agents · Part 6: Integration · Part 7: External Tools · Part 8: Observability · Part 9: Evaluation