Realtime Voice¶

Speech-to-speech AI conversations using providers like Google Gemini Live, OpenAI Realtime, and xAI Grok Realtime. Audio flows directly between the client and the AI provider; transcriptions are emitted into the Room so other channels (supervisor dashboards, logging) see the conversation.

RealtimeVoiceChannel¶

RealtimeVoiceChannel ¶

RealtimeVoiceChannel(channel_id, *, provider, transport, system_prompt=None, voice=None, tools=None, temperature=None, input_sample_rate=16000, output_sample_rate=24000, transport_sample_rate=None, emit_transcription_events=True, tool_handler=None, mute_on_tool_call=False, tool_result_max_length=16384, pipeline=None, recording=None, skills=None, script_executor=None)

Bases: RealtimeToolsMixin, RealtimeTranscriptionMixin, RealtimeSpeechMixin, RealtimeAudioMixin, RealtimeResponseMixin, VoicePipelineMixin, Channel

Real-time voice channel using speech-to-speech AI providers.

Wraps APIs like OpenAI Realtime and Gemini Live as a first-class RoomKit channel. Audio flows directly between the user's browser and the provider; transcriptions are emitted into the Room so other channels (supervisor dashboards, logging) see the conversation.

Category is TRANSPORT so that: - on_event() receives broadcasts (for text injection from supervisors) - deliver() is called but returns empty (customer is on voice)

Example

from roomkit.voice.realtime.mock import MockRealtimeProvider, MockRealtimeTransport

provider = MockRealtimeProvider() transport = MockRealtimeTransport()

channel = RealtimeVoiceChannel( "realtime-1", provider=provider, transport=transport, system_prompt="You are a helpful agent.", ) kit.register_channel(channel)

Initialize realtime voice channel.

Parameters:

Name	Type	Description	Default
`channel_id`	`str`	Unique channel identifier.	required
`provider`	`RealtimeVoiceProvider`	The realtime voice provider (OpenAI, Gemini, etc.).	required
`transport`	`VoiceBackend`	The audio transport (WebSocket, etc.).	required
`system_prompt`	`str \| None`	Default system prompt for the AI.	`None`
`voice`	`str \| None`	Default voice ID for audio output.	`None`
`tools`	`list[dict[str, Any] \| Any] \| None`	Tool definitions as dicts, or Tool objects with `.definition` and `.handler`. Tool objects have their handlers extracted and composed automatically.	`None`
`temperature`	`float \| None`	Default sampling temperature.	`None`
`input_sample_rate`	`int`	Default input audio sample rate (Hz).	`16000`
`output_sample_rate`	`int`	Default output audio sample rate (Hz).	`24000`
`transport_sample_rate`	`int \| None`	Sample rate of audio from the transport (Hz). When set and different from provider rates, enables automatic resampling. When `None` (default), no resampling is performed — backwards compatible with WebSocket transports.	`None`
`emit_transcription_events`	`bool`	If True, emit final transcriptions as RoomEvents so other channels see them.	`True`
`tool_handler`	`ToolHandler \| None`	Async callable to execute tool calls. Signature: `async (name, arguments) -> str`. If not set, falls back to handlers extracted from Tool objects, or `ON_TOOL_CALL` hooks.	`None`
`mute_on_tool_call`	`bool`	If True, mute the transport microphone during tool execution to prevent barge-in that causes providers (e.g. Gemini) to silently drop the tool result. Defaults to False — use `set_access()` for fine-grained control.	`False`
`tool_result_max_length`	`int`	Maximum character length of tool results before truncation. Large results (e.g. SVG payloads) can overflow the provider's context window. Defaults to 16384.	`16384`
`pipeline`	`AudioPipelineConfig \| None`	Optional `AudioPipelineConfig` for local audio processing (AEC, VAD, denoiser, etc.). When set, mic audio is processed through the pipeline before being forwarded to the provider, and pipeline VAD drives speech detection instead of server-side VAD.	`None`
`recording`	`Any \| None`	Optional `ChannelRecordingConfig` to enable room-level audio recording from this channel. Records both input (mic) and output (AI) audio tracks.	`None`
`skills`	`SkillRegistry \| None`	Optional `SkillRegistry` with discovered skills. When provided, skill infrastructure tools are injected and the skills preamble is appended to the system prompt.	`None`
`script_executor`	`ScriptExecutor \| None`	Optional `ScriptExecutor` for running skill scripts. Ignored when skills is `None`.	`None`

provider `property` ¶

provider

The underlying realtime voice provider.

session_rooms `property` ¶

session_rooms

Mapping of session_id to room_id.

tool_handler `property` `writable` ¶

tool_handler

The current tool handler for realtime tool calls.

get_room_sessions ¶

get_room_sessions(room_id)

Get all active sessions for a room.

wait_idle `async` ¶

wait_idle(room_id, timeout=15.0)

Wait until all sessions in the room are idle (not speaking).

An idle session has finished its last response and all audio has been forwarded to the transport.

set_framework ¶

set_framework(framework)

Set the framework reference for event routing.

Called automatically when the channel is registered with RoomKit.

on_trace ¶

on_trace(callback, *, protocols=None)

Register a trace observer and bridge to the transport.

resolve_trace_room ¶

resolve_trace_room(session_id)

Resolve room_id from realtime session mappings.

configure ¶

configure(*, system_prompt=None, voice=None, tools=None)

Update channel defaults for future sessions.

Active sessions are not affected — use reconfigure_session for those.

inject_text `async` ¶

inject_text(session, text, *, role='user', silent=False)

Inject a text turn into the provider session.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The active voice session.	required
`text`	`str`	Text to inject.	required
`role`	`str`	Role for the text ('user' or 'system').	`'user'`
`silent`	`bool`	If True, add to conversation context without requesting a response. The agent sees the text on its next turn but does not react immediately.	`False`

start_session `async` ¶

start_session(room_id, participant_id, connection, *, metadata=None)

Start a new realtime voice session.

Connects both the transport (client audio) and the provider (AI service), then fires a framework event.

Parameters:

Name	Type	Description	Default
`room_id`	`str`	The room to join.	required
`participant_id`	`str`	The participant's ID.	required
`connection`	`Any`	Protocol-specific connection (e.g. WebSocket).	required
`metadata`	`dict[str, Any] \| None`	Optional session metadata. May include overrides for system_prompt, voice, tools, temperature.	`None`

Returns:

Type	Description
`VoiceSession`	The created VoiceSession.

end_session `async` ¶

end_session(session)

End a realtime voice session.

Disconnects both provider and transport, fires framework event.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The session to end.	required

reconfigure_session `async` ¶

reconfigure_session(session, *, system_prompt=None, voice=None, tools=None, temperature=None, provider_config=None)

Reconfigure an active session with new agent parameters.

Used during agent handoff to switch the AI personality, voice, and tools. Providers with session resumption (e.g. Gemini Live) preserve conversation history across the reconfiguration.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The active session to reconfigure.	required
`system_prompt`	`str \| None`	New system instructions for the AI.	`None`
`voice`	`str \| None`	New voice ID for audio output.	`None`
`tools`	`list[dict[str, Any]] \| None`	New tool/function definitions.	`None`
`temperature`	`float \| None`	New sampling temperature.	`None`
`provider_config`	`dict[str, Any] \| None`	Provider-specific configuration overrides.	`None`

connect_session `async` ¶

connect_session(session, room_id, binding)

Accept a realtime voice session via process_inbound.

Delegates to :meth:start_session which handles provider/transport connection, resampling, and framework events.

disconnect_session `async` ¶

disconnect_session(session, room_id)

Clean up realtime sessions on remote disconnect.

update_binding ¶

update_binding(room_id, binding)

Update cached bindings for all sessions in a room.

Called by the framework after mute/unmute/set_access so the audio gate in _pipeline_on_audio_received (pipeline path) or _forward_client_audio (direct path) sees the new state.

handle_inbound `async` ¶

handle_inbound(message, context)

Not used directly — audio flows via start_session.

on_event `async` ¶

on_event(event, binding, context)

React to events from other channels — TEXT INJECTION.

When a supervisor or other channel sends a message, extract the text and inject it into the provider session so the AI incorporates it. Skips events from this channel (self-loop prevention).

deliver `async` ¶

deliver(event, binding, context)

No-op delivery — customer is on voice, can't see text.

close `async` ¶

close()

End all sessions and close provider + transport.

ToolHandler¶

ToolHandler = Callable[
    [str, dict[str, Any]],
    Awaitable[str],
]

Async callback for executing tool/function calls from the AI provider. Receives (tool_name, arguments) and returns a JSON string.

Provider ABC¶

RealtimeVoiceProvider ¶

Bases: ABC

Abstract base class for speech-to-speech AI providers.

Wraps APIs like OpenAI Realtime and Gemini Live that handle audio-in → audio-out with built-in AI, VAD, and transcription.

The provider manages a bidirectional audio/event stream with the AI service. Callbacks are registered for events the provider emits.

Example

provider = OpenAIRealtimeProvider(api_key="sk-...", model="gpt-realtime-1.5")

provider.on_audio(handle_audio) provider.on_transcription(handle_transcription) provider.on_tool_call(handle_tool_call)

await provider.connect(session, system_prompt="You are a helpful agent.") await provider.send_audio(session, audio_bytes) await provider.disconnect(session)

name `abstractmethod` `property` ¶

name

Provider name (e.g. 'openai_realtime', 'gemini_live').

connect `abstractmethod` `async` ¶

connect(session, *, system_prompt=None, voice=None, tools=None, temperature=None, input_sample_rate=16000, output_sample_rate=24000, server_vad=True, provider_config=None)

Connect a session to the provider's AI service.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The realtime session to connect.	required
`system_prompt`	`str \| None`	System instructions for the AI.	`None`
`voice`	`str \| None`	Voice ID for audio output.	`None`
`tools`	`list[dict[str, Any]] \| None`	Tool/function definitions the AI can call.	`None`
`temperature`	`float \| None`	Sampling temperature.	`None`
`input_sample_rate`	`int`	Sample rate of input audio (Hz).	`16000`
`output_sample_rate`	`int`	Desired sample rate for output audio (Hz).	`24000`
`server_vad`	`bool`	Whether to use server-side voice activity detection.	`True`
`provider_config`	`dict[str, Any] \| None`	Provider-specific configuration options. Each provider documents which keys it accepts.	`None`

send_audio `abstractmethod` `async` ¶

send_audio(session, audio)

Send audio data to the provider for processing.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The active session.	required
`audio`	`bytes`	Raw PCM audio bytes.	required

inject_text `abstractmethod` `async` ¶

inject_text(session, text, *, role='user', silent=False)

Inject text into the conversation (e.g. supervisor guidance).

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The active session.	required
`text`	`str`	Text to inject.	required
`role`	`str`	Role for the injected text ('user' or 'system').	`'user'`
`silent`	`bool`	If True, add to conversation context without requesting a response. The agent sees the text on its next turn but does not react immediately.	`False`

submit_tool_result `abstractmethod` `async` ¶

submit_tool_result(session, call_id, result)

Submit a tool call result back to the provider.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The active session.	required
`call_id`	`str`	The tool call ID from the on_tool_call callback.	required
`result`	`str`	JSON-serialized result string.	required

interrupt `abstractmethod` `async` ¶

interrupt(session)

Interrupt the current AI response.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The active session.	required

disconnect `abstractmethod` `async` ¶

disconnect(session)

Disconnect a session from the provider.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The session to disconnect.	required

reconfigure `async` ¶

reconfigure(session, *, system_prompt=None, voice=None, tools=None, temperature=None, provider_config=None)

Reconfigure a session with new parameters.

Used during agent handoff to switch the AI personality, voice, and tools. The default implementation disconnects and reconnects; providers with session resumption (e.g. Gemini Live) should override to preserve conversation context.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The active session to reconfigure.	required
`system_prompt`	`str \| None`	New system instructions.	`None`
`voice`	`str \| None`	New voice ID.	`None`
`tools`	`list[dict[str, Any]] \| None`	New tool/function definitions.	`None`
`temperature`	`float \| None`	New sampling temperature.	`None`
`provider_config`	`dict[str, Any] \| None`	Provider-specific configuration overrides.	`None`

send_event `async` ¶

send_event(session, event)

Send a raw provider-specific event to the underlying service.

This is an escape hatch for sending protocol-level messages that are not covered by the standard provider API (e.g. OpenAI's session.update or input_audio_buffer.commit).

The default implementation raises :exc:NotImplementedError. Providers that support raw events should override this.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The active session.	required
`event`	`dict[str, Any]`	A JSON-serializable dict that will be sent verbatim to the provider's underlying connection.	required

is_responding ¶

is_responding(session_id)

Check if the provider is actively generating a response.

Returns True between response.created and response.done.

close `async` ¶

close()

Release all provider resources.

on_audio ¶

on_audio(callback)

Register callback for audio output from the provider.

Parameters:

Name	Type	Description	Default
`callback`	`RealtimeAudioCallback`	Called with (session, audio_bytes) when the provider produces audio output.	required

on_transcription ¶

on_transcription(callback)

Register callback for transcription events.

Parameters:

Name	Type	Description	Default
`callback`	`RealtimeTranscriptionCallback`	Called with (session, text, role, is_final).	required

on_speech_start ¶

on_speech_start(callback)

Register callback for speech start detection.

Parameters:

Name	Type	Description	Default
`callback`	`RealtimeSpeechStartCallback`	Called with (session) when user speech is detected.	required

on_speech_end ¶

on_speech_end(callback)

Register callback for speech end detection.

Parameters:

Name	Type	Description	Default
`callback`	`RealtimeSpeechEndCallback`	Called with (session) when user speech ends.	required

on_tool_call ¶

on_tool_call(callback)

Register callback for tool/function calls from the AI.

Parameters:

Name	Type	Description	Default
`callback`	`RealtimeToolCallCallback`	Called with (session, call_id, name, arguments).	required

on_response_start ¶

on_response_start(callback)

Register callback for when the AI starts generating a response.

Parameters:

Name	Type	Description	Default
`callback`	`RealtimeResponseStartCallback`	Called with (session).	required

on_response_end ¶

on_response_end(callback)

Register callback for when the AI finishes a response.

Parameters:

Name	Type	Description	Default
`callback`	`RealtimeResponseEndCallback`	Called with (session).	required

on_error ¶

on_error(callback)

Register callback for provider errors.

Parameters:

Name	Type	Description	Default
`callback`	`RealtimeErrorCallback`	Called with (session, code, message).	required

send_activity_start `async` ¶

send_activity_start(session)

Signal that user speech activity has started.

Used in manual VAD mode: local VAD detects speech and the channel calls this to inform the provider. The provider translates this into its protocol's activity signal (e.g. Gemini ActivityStart).

Default: no-op. Override when the provider supports manual mode.

send_activity_end `async` ¶

send_activity_end(session)

Signal that user speech activity has ended.

Used in manual VAD mode: local VAD detects silence and the channel calls this to inform the provider. The provider translates this into its protocol's activity signal (e.g. Gemini ActivityEnd).

Default: no-op. Override when the provider supports manual mode.

Transport ABC¶

All realtime transports extend VoiceBackend.

Session & State¶

See VoiceSession and VoiceSessionState.

Events¶

RealtimeTranscriptionEvent `dataclass` ¶

RealtimeTranscriptionEvent(session, text, role, is_final, was_barge_in=False, item_id=None, timestamp=_utcnow())

Transcription produced by the realtime provider.

Fired through ON_TRANSCRIPTION hooks. For final transcriptions, the channel emits a RoomEvent so other channels see the conversation.

session `instance-attribute` ¶

session

The realtime session that produced this transcription.

text `instance-attribute` ¶

text

The transcribed text.

role `instance-attribute` ¶

role

Who spoke: 'user' (input) or 'assistant' (AI output).

is_final `instance-attribute` ¶

is_final

True if this is a final transcription (not interim).

was_barge_in `class-attribute` `instance-attribute` ¶

was_barge_in = False

True if this transcription resulted from a barge-in (user interrupted the AI while it was speaking).

item_id `class-attribute` `instance-attribute` ¶

item_id = None

Provider-specific item ID for correlation.

timestamp `class-attribute` `instance-attribute` ¶

timestamp = field(default_factory=_utcnow)

When the transcription was received.

RealtimeSpeechEvent `dataclass` ¶

RealtimeSpeechEvent(session, type, timestamp=_utcnow())

Speech activity detected by the realtime provider's server-side VAD.

session `instance-attribute` ¶

session

The realtime session where speech activity changed.

type `instance-attribute` ¶

type

Whether speech started or ended.

timestamp `class-attribute` `instance-attribute` ¶

timestamp = field(default_factory=_utcnow)

When the speech event was detected.

RealtimeToolCallEvent `dataclass` ¶

RealtimeToolCallEvent(session, tool_call_id, name, arguments, timestamp=_utcnow())

Deprecated — use :class:roomkit.models.tool_call.ToolCallEvent instead.

Kept for backward compatibility. The unified ON_TOOL_CALL hook fires :class:ToolCallEvent from both AIChannel and RealtimeVoiceChannel.

session `instance-attribute` ¶

session

The realtime session requesting the tool call.

tool_call_id `instance-attribute` ¶

tool_call_id

Provider-assigned ID for this tool call.

name `instance-attribute` ¶

name

The function name being called.

arguments `instance-attribute` ¶

arguments

Parsed arguments for the function call.

timestamp `class-attribute` `instance-attribute` ¶

timestamp = field(default_factory=_utcnow)

When the tool call was received.

RealtimeErrorEvent `dataclass` ¶

RealtimeErrorEvent(session, code, message, timestamp=_utcnow())

Error from the realtime provider.

session `instance-attribute` ¶

session

The realtime session that encountered the error.

code `instance-attribute` ¶

code

Error code from the provider.

message `instance-attribute` ¶

message

Human-readable error description.

timestamp `class-attribute` `instance-attribute` ¶

timestamp = field(default_factory=_utcnow)

When the error occurred.

Callback Types¶

Callback	Signature
`RealtimeAudioCallback`	`(RealtimeSession, bytes) -> Any`
`RealtimeTranscriptionCallback`	`(RealtimeSession, str, str, bool) -> Any`
`RealtimeSpeechStartCallback`	`(RealtimeSession) -> Any`
`RealtimeSpeechEndCallback`	`(RealtimeSession) -> Any`
`RealtimeToolCallCallback`	`(RealtimeSession, str, str, dict) -> Any`
`RealtimeResponseStartCallback`	`(RealtimeSession) -> Any`
`RealtimeResponseEndCallback`	`(RealtimeSession) -> Any`
`RealtimeErrorCallback`	`(RealtimeSession, str, str) -> Any`

Concrete Providers¶

Gemini Live¶

from roomkit.providers.gemini.realtime import GeminiLiveProvider

provider = GeminiLiveProvider(
    api_key="...",
    model="gemini-2.5-flash-native-audio-preview-12-2025",
)

Install with: pip install roomkit[realtime-gemini]

Supports provider_config metadata keys: language, top_p, top_k, max_output_tokens, seed, enable_affective_dialog, thinking_budget, proactive_audio, start_of_speech_sensitivity, end_of_speech_sensitivity, silence_duration_ms, prefix_padding_ms, no_interruption.

OpenAI Realtime¶

from roomkit.providers.openai.realtime import OpenAIRealtimeProvider

provider = OpenAIRealtimeProvider(
    api_key="...",
    model="gpt-4o-realtime-preview",
)

Install with: pip install roomkit[realtime-openai]

xAI Grok Realtime¶

from roomkit.providers.xai.config import XAIRealtimeConfig
from roomkit.providers.xai.realtime import XAIRealtimeProvider

provider = XAIRealtimeProvider(
    XAIRealtimeConfig(
        api_key="...",
        model="grok-3-fast",
        voice="eve",
        transcription_model="grok-2-audio",
    )
)

Install with: pip install websockets

Supports provider_config metadata keys: turn_detection_type, threshold, silence_duration_ms, prefix_padding_ms, transcription_model, input_audio_format, output_audio_format, modalities.

Native tools: web_search, x_search (passed as {"type": "web_search"}).

Available voices: eve, ara, rex, sal, leo.

Transports¶

WebSocket Transport¶

from roomkit.voice.realtime.ws_transport import WebSocketRealtimeTransport

transport = WebSocketRealtimeTransport()

Handles bidirectional audio over WebSocket. Client sends {"type": "audio", "data": "<base64 PCM>"}, server sends audio, transcriptions, and speaking indicators as JSON messages.

Install with: pip install roomkit[websocket]

FastRTC Transport (WebRTC)¶

from roomkit.voice.realtime.fastrtc_transport import (
    FastRTCRealtimeTransport,
    mount_fastrtc_realtime,
)

transport = FastRTCRealtimeTransport(
    input_sample_rate=16000,   # Mic input sample rate
    output_sample_rate=24000,  # Provider output sample rate
)

# Mount WebRTC endpoints on FastAPI app
mount_fastrtc_realtime(app, transport, path="/rtc-realtime")

WebRTC-based transport using FastRTC in passthrough mode (no VAD). Audio flows bidirectionally between the browser and the speech-to-speech AI provider, which handles its own server-side VAD.

Unlike the FastRTCVoiceBackend (which uses ReplyOnPause for VAD in the traditional VoiceChannel pipeline), this transport simply passes audio through — ideal for speech-to-speech providers like OpenAI Realtime or Gemini Live.

Install with: pip install roomkit[fastrtc]

Connection flow:

mount_fastrtc_realtime(app, transport) creates a FastRTC Stream with a passthrough handler
Browser connects via WebRTC → FastRTC calls handler.copy() → start_up()
start_up() reads webrtc_id from FastRTC context, registers with transport
Transport fires on_client_connected callback (if set) — use to auto-create sessions
App calls channel.start_session(room_id, participant_id, connection=webrtc_id)
start_session() → transport.accept(session, webrtc_id) maps session to handler
receive()/emit() now flow audio with session context

Audio format conversion:

FastRTC works with (sample_rate, numpy.ndarray[int16]) tuples. The transport ABC uses raw bytes (PCM16 LE). Conversion is automatic:

Inbound: audio_array.tobytes() → fire callbacks with bytes
Outbound: np.frombuffer(audio_bytes, dtype=np.int16) → return as (rate, array)

DataChannel messages:

JSON messages (transcriptions, speaking indicators) are sent via the WebRTC DataChannel using send_message().

Transport	Protocol	VAD	Dependency
`WebSocketRealtimeTransport`	WebSocket	Provider-side	`roomkit[websocket]`
`FastRTCRealtimeTransport`	WebRTC	Provider-side	`roomkit[fastrtc]`

Lazy-loaded via roomkit.voice.get_fastrtc_realtime_transport() and roomkit.voice.get_mount_fastrtc_realtime() to avoid requiring fastrtc/numpy at import time.

Mock Classes¶

MockRealtimeProvider ¶

MockRealtimeProvider()

Bases: RealtimeVoiceProvider

Mock realtime voice provider for testing.

Tracks all method calls and provides helpers to simulate provider events (transcriptions, audio, tool calls, etc.).

Example

provider = MockRealtimeProvider()

Track calls¶

await provider.connect(session, system_prompt="Hello") assert provider.calls[-1].method == "connect"

Simulate events¶

await provider.simulate_transcription(session, "Hi there", "user", True) await provider.simulate_audio(session, b"audio-data") await provider.simulate_tool_call(session, "call-1", "get_weather", {"city": "NYC"})

simulate_audio `async` ¶

simulate_audio(session, audio)

Simulate audio output from the provider.

simulate_transcription `async` ¶

simulate_transcription(session, text, role='assistant', is_final=True)

Simulate a transcription event from the provider.

simulate_speech_start `async` ¶

simulate_speech_start(session)

Simulate speech start detection.

simulate_speech_end `async` ¶

simulate_speech_end(session)

Simulate speech end detection.

simulate_tool_call `async` ¶

simulate_tool_call(session, call_id, name, arguments=None)

Simulate a tool call from the provider.

simulate_response_start `async` ¶

simulate_response_start(session)

Simulate response generation starting.

simulate_response_end `async` ¶

simulate_response_end(session)

Simulate response generation ending.

simulate_error `async` ¶

simulate_error(session, code, message)

Simulate a provider error.

MockRealtimeTransport ¶

MockRealtimeTransport()

Bases: VoiceBackend

Mock realtime audio transport for testing.

Tracks all method calls and provides helpers to simulate client events (audio, disconnection).

Example

transport = MockRealtimeTransport()

await transport.accept(session, "fake-ws") assert transport.calls[-1].method == "accept"

Simulate client audio¶

await transport.simulate_client_audio(session, b"audio-data")

send_message `async` ¶

send_message(session, message)

Send a JSON message (not part of VoiceBackend ABC, used by tests).

simulate_client_audio `async` ¶

simulate_client_audio(session, audio)

Simulate audio received from the client browser.

simulate_client_disconnect `async` ¶

simulate_client_disconnect(session)

Simulate the client disconnecting.

Usage Example¶

from roomkit import RoomKit, RealtimeVoiceChannel
from roomkit.providers.gemini.realtime import GeminiLiveProvider
from roomkit.voice.realtime.ws_transport import WebSocketRealtimeTransport

# Configure provider and transport
provider = GeminiLiveProvider(api_key="...")
transport = WebSocketRealtimeTransport()

# Create and register the channel
channel = RealtimeVoiceChannel(
    "realtime-voice",
    provider=provider,
    transport=transport,
    system_prompt="You are a helpful voice assistant.",
    voice="Aoede",
)

kit = RoomKit()
kit.register_channel(channel)

# Create room and attach
room = await kit.create_room(room_id="voice-room")
await kit.attach_channel("voice-room", "realtime-voice")

# When a WebSocket connection arrives:
session = await channel.start_session(
    "voice-room", "participant-1", websocket
)

# Transcriptions are emitted as RoomEvents —
# other channels in the room see the conversation.

# Text from other channels is injected into the AI session:
# supervisor types "Offer 20% discount" via WebSocket →
# AI incorporates it into its next spoken response.

With Tool Calling¶

import json

from roomkit import Tool

async def lookup_order(order_id: str) -> str:
    return json.dumps({"status": "shipped", "eta": "Tomorrow"})

order_tool = Tool(
    name="lookup_order",
    description="Look up an order by ID",
    parameters={
        "type": "object",
        "properties": {"order_id": {"type": "string"}},
        "required": ["order_id"],
    },
    handler=lookup_order,
)

channel = RealtimeVoiceChannel(
    "realtime-voice",
    provider=provider,
    transport=transport,
    system_prompt="You are a helpful assistant with access to tools.",
    tools=[order_tool],
)

With MCP (Advanced)¶

For MCP integration, use tool_handler to route calls to the MCP server:

import json

from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client

# Connect to MCP server
read, write, _ = await streamablehttp_client("http://localhost:9998/mcp")
mcp = ClientSession(read, write)
await mcp.initialize()

# Discover tools
tools_result = await mcp.list_tools()
tools = [
    {
        "name": t.name,
        "description": t.description or t.name,
        "parameters": t.inputSchema or {"type": "object", "properties": {}},
    }
    for t in tools_result.tools
]

# Tool handler routes calls to MCP
async def handle_tool(name, arguments):
    result = await mcp.call_tool(name, arguments)
    return json.dumps({"result": "\n".join(b.text for b in result.content if hasattr(b, "text"))})

channel = RealtimeVoiceChannel(
    "realtime-voice",
    provider=provider,
    transport=transport,
    system_prompt="You are a helpful assistant with access to tools.",
    tools=tools,
    tool_handler=handle_tool,
)

With Delegation¶

Wire the delegate_task tool into a voice agent:

from roomkit.tasks import DelegateHandler, setup_realtime_delegation, build_delegate_tool

handler = DelegateHandler(kit, notify="realtime-voice")
tool = build_delegate_tool([("exec-agent", "Runs background tasks")])
setup_realtime_delegation(channel, handler, tool=tool)

See setup_realtime_delegation() for details.

With Vision Injection¶

Wire screen/camera vision into the voice session:

from roomkit import setup_realtime_vision

setup_realtime_vision(kit, room_id="room-1", voice_channel_id="realtime-voice")

Vision descriptions are injected via inject_text(silent=True) — the agent sees the context on its next turn without interrupting the conversation. Unchanged descriptions are automatically deduplicated.

See setup_realtime_vision() for details.

With FastRTC (WebRTC)¶

from fastapi import FastAPI
from roomkit import RoomKit, RealtimeVoiceChannel
from roomkit.providers.gemini.realtime import GeminiLiveProvider
from roomkit.voice.realtime.fastrtc_transport import (
    FastRTCRealtimeTransport,
    mount_fastrtc_realtime,
)

app = FastAPI()

provider = GeminiLiveProvider(api_key="...")
transport = FastRTCRealtimeTransport(
    input_sample_rate=16000,
    output_sample_rate=24000,
)

channel = RealtimeVoiceChannel(
    "realtime-voice",
    provider=provider,
    transport=transport,
    system_prompt="You are a helpful voice assistant.",
    voice="Aoede",
)

kit = RoomKit()
kit.register_channel(channel)

# Mount WebRTC endpoints
mount_fastrtc_realtime(app, transport, path="/rtc-realtime")

# Auto-create sessions when WebRTC clients connect
async def on_connected(webrtc_id: str):
    room = await kit.create_room()
    await kit.attach_channel(room.id, "realtime-voice")
    await channel.start_session(room.id, "user-1", connection=webrtc_id)

transport.on_client_connected(on_connected)

Realtime Voice¶

RealtimeVoiceChannel¶

RealtimeVoiceChannel ¶

provider property ¶

session_rooms property ¶

tool_handler property writable ¶

get_room_sessions ¶

wait_idle async ¶

set_framework ¶

on_trace ¶

resolve_trace_room ¶

configure ¶

inject_text async ¶

start_session async ¶

end_session async ¶

reconfigure_session async ¶

connect_session async ¶

disconnect_session async ¶

update_binding ¶

handle_inbound async ¶

on_event async ¶

deliver async ¶

close async ¶

ToolHandler¶

Provider ABC¶

RealtimeVoiceProvider ¶

name abstractmethod property ¶

connect abstractmethod async ¶

send_audio abstractmethod async ¶

inject_text abstractmethod async ¶

submit_tool_result abstractmethod async ¶

interrupt abstractmethod async ¶

disconnect abstractmethod async ¶

reconfigure async ¶

send_event async ¶

is_responding ¶

close async ¶

on_audio ¶

on_transcription ¶

on_speech_start ¶

on_speech_end ¶

on_tool_call ¶

on_response_start ¶

on_response_end ¶

on_error ¶

send_activity_start async ¶

send_activity_end async ¶

Transport ABC¶

Session & State¶

Events¶

RealtimeTranscriptionEvent dataclass ¶

session instance-attribute ¶

text instance-attribute ¶

role instance-attribute ¶

is_final instance-attribute ¶

was_barge_in class-attribute instance-attribute ¶

item_id class-attribute instance-attribute ¶

timestamp class-attribute instance-attribute ¶

RealtimeSpeechEvent dataclass ¶

session instance-attribute ¶

type instance-attribute ¶

timestamp class-attribute instance-attribute ¶

RealtimeToolCallEvent dataclass ¶

session instance-attribute ¶

tool_call_id instance-attribute ¶

name instance-attribute ¶

arguments instance-attribute ¶

timestamp class-attribute instance-attribute ¶

RealtimeErrorEvent dataclass ¶

session instance-attribute ¶

code instance-attribute ¶

message instance-attribute ¶

timestamp class-attribute instance-attribute ¶

Callback Types¶

Concrete Providers¶

Gemini Live¶

OpenAI Realtime¶

xAI Grok Realtime¶

Transports¶

WebSocket Transport¶

provider `property` ¶

session_rooms `property` ¶

tool_handler `property` `writable` ¶

wait_idle `async` ¶

inject_text `async` ¶

start_session `async` ¶

end_session `async` ¶

reconfigure_session `async` ¶

connect_session `async` ¶

disconnect_session `async` ¶

handle_inbound `async` ¶

on_event `async` ¶

deliver `async` ¶

close `async` ¶

name `abstractmethod` `property` ¶

connect `abstractmethod` `async` ¶

send_audio `abstractmethod` `async` ¶

inject_text `abstractmethod` `async` ¶

submit_tool_result `abstractmethod` `async` ¶

interrupt `abstractmethod` `async` ¶

disconnect `abstractmethod` `async` ¶

reconfigure `async` ¶

send_event `async` ¶

close `async` ¶

send_activity_start `async` ¶

send_activity_end `async` ¶

RealtimeTranscriptionEvent `dataclass` ¶

session `instance-attribute` ¶

text `instance-attribute` ¶

role `instance-attribute` ¶

is_final `instance-attribute` ¶

was_barge_in `class-attribute` `instance-attribute` ¶

item_id `class-attribute` `instance-attribute` ¶

timestamp `class-attribute` `instance-attribute` ¶

RealtimeSpeechEvent `dataclass` ¶

session `instance-attribute` ¶

type `instance-attribute` ¶

timestamp `class-attribute` `instance-attribute` ¶

RealtimeToolCallEvent `dataclass` ¶

session `instance-attribute` ¶

tool_call_id `instance-attribute` ¶

name `instance-attribute` ¶

arguments `instance-attribute` ¶

timestamp `class-attribute` `instance-attribute` ¶

RealtimeErrorEvent `dataclass` ¶

session `instance-attribute` ¶

code `instance-attribute` ¶

message `instance-attribute` ¶

timestamp `class-attribute` `instance-attribute` ¶

simulate_audio `async` ¶

simulate_transcription `async` ¶

simulate_speech_start `async` ¶

simulate_speech_end `async` ¶

simulate_tool_call `async` ¶

simulate_response_start `async` ¶

simulate_response_end `async` ¶

simulate_error `async` ¶

send_message `async` ¶

simulate_client_audio `async` ¶

simulate_client_disconnect `async` ¶