Multi-Agent Architecture, Part 8: Observability — Seeing Inside Your Multi-Agent System

This is Part 8 of the Multi-Agent Architecture series. In Part 7 we covered external tools — how agents execute actions in the outside world. Today we tackle the question that keeps coming up once your system is running: what is actually happening inside it?

The Black Box Problem

A user calls your voice assistant. The STT provider transcribes their speech. An inbound pipeline normalizes the message. The orchestrator routes it to one of three agents. That agent calls an LLM, which invokes two tools, which triggers a broadcast back through TTS to the user. Total wall time: 2.3 seconds. The user heard a response. Everything looks fine.

Except next Tuesday, the same flow takes 11 seconds. The user hangs up. Your team opens a ticket: "voice assistant is slow." Now what? Which of the six components in that chain got slower? Was it the STT provider? The LLM? A tool call that timed out and retried? Without observability, you are guessing. And in a multi-agent system with multiple LLM providers, multiple tool integrations, and conversations that span agents — guessing does not scale.

I have debugged enough production systems to know that the answer is never "add more logging." The answer is structured telemetry: spans with parent-child relationships, attributes with standardized names, and metrics that let you slice by session, agent, provider, and operation type. That is what RoomKit's telemetry layer provides out of the box.

TelemetryProvider: Three Tiers for Three Stages

RoomKit defines a TelemetryProvider ABC with three implementations, each designed for a different stage of your project lifecycle.

NoopTelemetryProvider is the default. It does nothing — zero overhead, zero output. When you are prototyping and do not care about traces yet, you pay nothing for the telemetry infrastructure being there. Every span call is a no-op.

ConsoleTelemetryProvider is for development. It prints spans and metrics to stderr in a human-readable format. When you are building locally and want to see the flow of a request without setting up a collector, this is what you reach for.

from roomkit import RoomKit
from roomkit.telemetry import ConsoleTelemetryProvider

kit = RoomKit(
    telemetry=ConsoleTelemetryProvider(),
)

# Every span now prints to stderr:
# [SPAN] VOICE_SESSION session_id=sess-a1b2 room_id=room-7742 duration_ms=4312
#   [SPAN] STT_TRANSCRIBE provider=deepgram ttfb_ms=180 duration_ms=620
#   [SPAN] INBOUND_PIPELINE frame_count=1 bytes_processed=2048
#     [SPAN] LLM_GENERATE provider=openai model=gpt-4o input_tokens=340 output_tokens=89 ttfb_ms=410
#       [SPAN] LLM_TOOL_CALL tool=lookup_order duration_ms=120
#     [SPAN] BROADCAST targets=3 duration_ms=12
#       [SPAN] DELIVERY channel=voice-out duration_ms=45
#       [SPAN] TTS_SYNTHESIZE provider=elevenlabs ttfb_ms=95 duration_ms=310

That output alone tells you more about a single request than most logging setups will tell you in a week. You can see the full span hierarchy, every provider involved, and exactly where the time went.

OpenTelemetryProvider is for production. It bridges RoomKit's internal telemetry to the OpenTelemetry SDK, which means your spans flow directly into whatever backend you already use — Jaeger, Grafana Tempo, Datadog, Honeycomb. No custom exporters, no proprietary format. Standard OTel.

from roomkit import RoomKit
from roomkit.telemetry.opentelemetry import OpenTelemetryProvider
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Standard OTel setup
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://jaeger:4317"))
)

# Bridge RoomKit telemetry to OTel
kit = RoomKit(
    telemetry=OpenTelemetryProvider(tracer_provider=tracer_provider),
)

# Every RoomKit span is now an OTel span
# session_id, room_id, and all attributes propagate automatically
# View traces in Jaeger, Grafana, or any OTel-compatible backend

The key design decision here is that RoomKit does not depend on the OpenTelemetry SDK. The OpenTelemetryProvider is the bridge. If you never import it, you never pull in the OTel dependency. If you do, you get full compatibility with zero adapter code.

SpanKind: A Vocabulary for Multi-Agent Tracing

Generic spans with string names are barely better than log lines. What makes RoomKit's telemetry useful is SpanKind — a set of constants that categorize every operation in the system. Each span has a kind, and that kind tells you exactly what layer of the architecture you are looking at:

STT_TRANSCRIBE, TTS_SYNTHESIZE — speech-to-text and text-to-speech provider calls
LLM_GENERATE, LLM_TOOL_CALL — LLM inference and tool invocations
HOOK_SYNC, HOOK_ASYNC — synchronous and asynchronous hook executions
INBOUND_PIPELINE — the full inbound message processing chain
BROADCAST, DELIVERY — fan-out to channels and individual channel delivery
VOICE_SESSION — the top-level span for a complete voice interaction
STORE_QUERY — storage layer reads and writes
REALTIME_SESSION, REALTIME_TURN, REALTIME_TOOL_CALL — realtime voice session lifecycle

These are not arbitrary strings. They form a hierarchy. A VOICE_SESSION span contains STT_TRANSCRIBE, INBOUND_PIPELINE, LLM_GENERATE, BROADCAST, and TTS_SYNTHESIZE as children. A BROADCAST contains one DELIVERY per target channel. An LLM_GENERATE may contain one or more LLM_TOOL_CALL spans. The hierarchy mirrors the actual execution flow, so when you open a trace in Jaeger, you see the architecture of the request, not a flat list of events.

VOICE_SESSION (4312ms)
├── STT_TRANSCRIBE (620ms, ttfb=180ms)
├── INBOUND_PIPELINE (1840ms)
│   ├── HOOK_SYNC: validate_content (3ms)
│   └── LLM_GENERATE (1420ms, tokens_in=340, tokens_out=89)
│       ├── LLM_TOOL_CALL: lookup_order (120ms)
│       └── LLM_TOOL_CALL: check_inventory (95ms)
└── BROADCAST (357ms)
    ├── DELIVERY: voice-out (45ms)
    └── TTS_SYNTHESIZE (310ms, ttfb=95ms)

Attributes That Answer Real Questions

Every span carries typed attributes. These are not free-form key-value pairs you hope someone remembers to populate — they are standardized fields that RoomKit sets automatically on every span of the appropriate kind.

DURATION_MS is on every span. TTFB_MS (time-to-first-byte) is on LLM, STT, and TTS spans — this is the metric that matters most for perceived responsiveness. A TTS call that takes 800ms total but streams the first audio chunk at 95ms feels instant. One that buffers for 800ms before sending anything feels broken. TTFB separates the two.

LLM_INPUT_TOKENS and LLM_OUTPUT_TOKENS are on every LLM_GENERATE span. This gives you token-level cost attribution per session, per agent, per room. When your monthly LLM bill spikes, you can query your trace backend and find out exactly which agent, which conversation, and which prompt is responsible.

For realtime sessions, you get uptime_s and turn_count. For pipelines, frame_count and bytes_processed. For deliveries, duration_ms per channel. Every metric is scoped to a span, and every span is scoped to a session and room.

Context Propagation with ContextVars

In an async Python system, correlating operations to a session is not trivial. RoomKit uses contextvars.ContextVar to propagate session_id and room_id through the entire call stack. When a voice session starts, these values are set once. Every span created within that async context — STT, LLM, hooks, broadcasts — automatically inherits them. No manual threading of IDs through function signatures. No middleware that might miss a code path.

This means you can filter every trace in your backend by session or room and get the complete picture: every operation, every provider call, every hook execution, every delivery. One query. Full context.

FrameworkEvent: When Things Go Wrong

Spans tell you what happened during normal operation. FrameworkEvent emissions tell you what happened when things broke. RoomKit emits structured events for failure conditions that matter in production:

event_blocked — a hook blocked a message, with the hook name and reason attached
delivery_failed — a channel delivery failed, with the error and target channel
chain_depth_exceeded — an agent chain hit the recursion limit, preventing infinite loops
process_timeout — an operation exceeded its time budget

Blocked events are especially important in multi-agent systems. When a BEFORE_BROADCAST hook blocks a message, the telemetry records status=BLOCKED, the name of the hook that blocked it, and the reason string. This creates an audit trail. If a customer reports that "the bot never responded," you can search for blocked events on their session and find out exactly which hook stopped the message and why.

Testing with MockTelemetryProvider

Observability is not just for production. RoomKit includes MockTelemetryProvider specifically for testing. It captures every span in memory so you can assert on them in your test suite.

from roomkit.telemetry import MockTelemetryProvider, SpanKind

async def test_llm_token_tracking():
    telemetry = MockTelemetryProvider()
    kit = RoomKit(telemetry=telemetry)

    # ... run a conversation that triggers LLM inference ...

    # Assert on captured spans
    llm_spans = telemetry.get_spans(kind=SpanKind.LLM_GENERATE)
    assert len(llm_spans) == 1
    assert llm_spans[0].attributes["llm.input_tokens"] > 0
    assert llm_spans[0].attributes["llm.output_tokens"] > 0
    assert llm_spans[0].attributes["ttfb_ms"] < 5000

    # Assert on the span hierarchy
    session_spans = telemetry.get_spans(kind=SpanKind.VOICE_SESSION)
    assert llm_spans[0].parent_id == session_spans[0].id

This is the kind of test that catches regressions before they hit production. If a refactor accidentally breaks token tracking or changes the span hierarchy, your tests fail. Observability becomes a contract, not a best-effort annotation.

What You Can Answer

With RoomKit's telemetry in place, you can answer the questions that actually come up in production multi-agent systems:

Where is the latency? — Sort spans by DURATION_MS within a session. The slowest span is your bottleneck.
Which provider is slow? — Filter STT_TRANSCRIBE, LLM_GENERATE, or TTS_SYNTHESIZE spans by provider and compare TTFB_MS distributions.
How much is this costing? — Aggregate LLM_INPUT_TOKENS and LLM_OUTPUT_TOKENS by agent, room, or time window.
Why did the user not get a response? — Search for event_blocked or delivery_failed events on their session.
Which agent handled this conversation? — The session_id and room_id on every span give you full agent traceability across the entire conversation.
Is the system degrading? — Track TTFB_MS percentiles over time. A creeping P95 tells you something is wrong before users start complaining.

None of this requires custom instrumentation. You set a TelemetryProvider when you create your RoomKit instance, and every component in the framework reports its telemetry through it. Swap ConsoleTelemetryProvider for OpenTelemetryProvider when you move to staging. The traces get richer, but the code stays identical.

This article is part of a 9-part series on production-ready multi-agent architecture. Next up: Part 9: Evaluation.

Series: Introduction · Part 1: User Interaction · Part 2: Orchestration · Part 3: Knowledge · Part 4: Storage · Part 5: Agents · Part 6: Integration · Part 7: External Tools · Part 8: Observability · Part 9: Evaluation