Skip to content

Realtime Voice

Speech-to-speech AI conversations using providers like Google Gemini Live and OpenAI Realtime. Audio flows directly between the client and the AI provider; transcriptions are emitted into the Room so other channels (supervisor dashboards, logging) see the conversation.

RealtimeVoiceChannel

RealtimeVoiceChannel

RealtimeVoiceChannel(channel_id, *, provider, transport, system_prompt=None, voice=None, tools=None, temperature=None, input_sample_rate=16000, output_sample_rate=24000, transport_sample_rate=None, emit_transcription_events=True, tool_handler=None, mute_on_tool_call=False, tool_result_max_length=16384)

Bases: Channel

Real-time voice channel using speech-to-speech AI providers.

Wraps APIs like OpenAI Realtime and Gemini Live as a first-class RoomKit channel. Audio flows directly between the user's browser and the provider; transcriptions are emitted into the Room so other channels (supervisor dashboards, logging) see the conversation.

Category is TRANSPORT so that: - on_event() receives broadcasts (for text injection from supervisors) - deliver() is called but returns empty (customer is on voice)

Example

from roomkit.voice.realtime.mock import MockRealtimeProvider, MockRealtimeTransport

provider = MockRealtimeProvider() transport = MockRealtimeTransport()

channel = RealtimeVoiceChannel( "realtime-1", provider=provider, transport=transport, system_prompt="You are a helpful agent.", ) kit.register_channel(channel)

Initialize realtime voice channel.

Parameters:

Name Type Description Default
channel_id str

Unique channel identifier.

required
provider RealtimeVoiceProvider

The realtime voice provider (OpenAI, Gemini, etc.).

required
transport VoiceBackend

The audio transport (WebSocket, etc.).

required
system_prompt str | None

Default system prompt for the AI.

None
voice str | None

Default voice ID for audio output.

None
tools list[dict[str, Any]] | None

Default tool/function definitions.

None
temperature float | None

Default sampling temperature.

None
input_sample_rate int

Default input audio sample rate (Hz).

16000
output_sample_rate int

Default output audio sample rate (Hz).

24000
transport_sample_rate int | None

Sample rate of audio from the transport (Hz). When set and different from provider rates, enables automatic resampling. When None (default), no resampling is performed — backwards compatible with WebSocket transports.

None
emit_transcription_events bool

If True, emit final transcriptions as RoomEvents so other channels see them.

True
tool_handler ToolHandler | None

Async callable to execute tool calls. Signature: async (session, name, arguments) -> result. Return a dict or JSON string. If not set, falls back to ON_REALTIME_TOOL_CALL hooks.

None
mute_on_tool_call bool

If True, mute the transport microphone during tool execution to prevent barge-in that causes providers (e.g. Gemini) to silently drop the tool result. Defaults to False — use set_access() for fine-grained control.

False
tool_result_max_length int

Maximum character length of tool results before truncation. Large results (e.g. SVG payloads) can overflow the provider's context window. Defaults to 16384.

16384

provider property

provider

The underlying realtime voice provider.

session_rooms property

session_rooms

Mapping of session_id to room_id.

tool_handler property writable

tool_handler

The current tool handler for realtime tool calls.

get_room_sessions

get_room_sessions(room_id)

Get all active sessions for a room.

set_framework

set_framework(framework)

Set the framework reference for event routing.

Called automatically when the channel is registered with RoomKit.

on_trace

on_trace(callback, *, protocols=None)

Register a trace observer and bridge to the transport.

resolve_trace_room

resolve_trace_room(session_id)

Resolve room_id from realtime session mappings.

inject_text async

inject_text(session, text, *, role='user')

Inject a text turn into the provider session.

Useful for nudging the provider when its server-side VAD stalls (e.g. Gemini ignoring valid speech after turn_complete).

start_session async

start_session(room_id, participant_id, connection, *, metadata=None)

Start a new realtime voice session.

Connects both the transport (client audio) and the provider (AI service), then fires a framework event.

Parameters:

Name Type Description Default
room_id str

The room to join.

required
participant_id str

The participant's ID.

required
connection Any

Protocol-specific connection (e.g. WebSocket).

required
metadata dict[str, Any] | None

Optional session metadata. May include overrides for system_prompt, voice, tools, temperature.

None

Returns:

Type Description
VoiceSession

The created VoiceSession.

end_session async

end_session(session)

End a realtime voice session.

Disconnects both provider and transport, fires framework event.

Parameters:

Name Type Description Default
session VoiceSession

The session to end.

required

reconfigure_session async

reconfigure_session(session, *, system_prompt=None, voice=None, tools=None, temperature=None, provider_config=None)

Reconfigure an active session with new agent parameters.

Used during agent handoff to switch the AI personality, voice, and tools. Providers with session resumption (e.g. Gemini Live) preserve conversation history across the reconfiguration.

Parameters:

Name Type Description Default
session VoiceSession

The active session to reconfigure.

required
system_prompt str | None

New system instructions for the AI.

None
voice str | None

New voice ID for audio output.

None
tools list[dict[str, Any]] | None

New tool/function definitions.

None
temperature float | None

New sampling temperature.

None
provider_config dict[str, Any] | None

Provider-specific configuration overrides.

None

connect_session async

connect_session(session, room_id, binding)

Accept a realtime voice session via process_inbound.

Delegates to :meth:start_session which handles provider/transport connection, resampling, and framework events.

disconnect_session async

disconnect_session(session, room_id)

Clean up realtime sessions on remote disconnect.

update_binding

update_binding(room_id, binding)

Update cached bindings for all sessions in a room.

Called by the framework after mute/unmute/set_access so the audio gate in _forward_client_audio sees the new state.

handle_inbound async

handle_inbound(message, context)

Not used directly — audio flows via start_session.

on_event async

on_event(event, binding, context)

React to events from other channels — TEXT INJECTION.

When a supervisor or other channel sends a message, extract the text and inject it into the provider session so the AI incorporates it. Skips events from this channel (self-loop prevention).

deliver async

deliver(event, binding, context)

No-op delivery — customer is on voice, can't see text.

close async

close()

End all sessions and close provider + transport.

ToolHandler

ToolHandler = Callable[
    [RealtimeSession, str, dict[str, Any]],
    Awaitable[dict[str, Any] | str],
]

Async callback for executing tool/function calls from the AI provider. Receives (session, tool_name, arguments) and returns a result dict or JSON string.

Provider ABC

RealtimeVoiceProvider

Bases: ABC

Abstract base class for speech-to-speech AI providers.

Wraps APIs like OpenAI Realtime and Gemini Live that handle audio-in → audio-out with built-in AI, VAD, and transcription.

The provider manages a bidirectional audio/event stream with the AI service. Callbacks are registered for events the provider emits.

Example

provider = OpenAIRealtimeProvider(api_key="sk-...", model="gpt-4o-realtime")

provider.on_audio(handle_audio) provider.on_transcription(handle_transcription) provider.on_tool_call(handle_tool_call)

await provider.connect(session, system_prompt="You are a helpful agent.") await provider.send_audio(session, audio_bytes) await provider.disconnect(session)

name abstractmethod property

name

Provider name (e.g. 'openai_realtime', 'gemini_live').

connect abstractmethod async

connect(session, *, system_prompt=None, voice=None, tools=None, temperature=None, input_sample_rate=16000, output_sample_rate=24000, server_vad=True, provider_config=None)

Connect a session to the provider's AI service.

Parameters:

Name Type Description Default
session VoiceSession

The realtime session to connect.

required
system_prompt str | None

System instructions for the AI.

None
voice str | None

Voice ID for audio output.

None
tools list[dict[str, Any]] | None

Tool/function definitions the AI can call.

None
temperature float | None

Sampling temperature.

None
input_sample_rate int

Sample rate of input audio (Hz).

16000
output_sample_rate int

Desired sample rate for output audio (Hz).

24000
server_vad bool

Whether to use server-side voice activity detection.

True
provider_config dict[str, Any] | None

Provider-specific configuration options. Each provider documents which keys it accepts.

None

send_audio abstractmethod async

send_audio(session, audio)

Send audio data to the provider for processing.

Parameters:

Name Type Description Default
session VoiceSession

The active session.

required
audio bytes

Raw PCM audio bytes.

required

inject_text abstractmethod async

inject_text(session, text, *, role='user')

Inject text into the conversation (e.g. supervisor guidance).

Parameters:

Name Type Description Default
session VoiceSession

The active session.

required
text str

Text to inject.

required
role str

Role for the injected text ('user' or 'system').

'user'

submit_tool_result abstractmethod async

submit_tool_result(session, call_id, result)

Submit a tool call result back to the provider.

Parameters:

Name Type Description Default
session VoiceSession

The active session.

required
call_id str

The tool call ID from the on_tool_call callback.

required
result str

JSON-serialized result string.

required

interrupt abstractmethod async

interrupt(session)

Interrupt the current AI response.

Parameters:

Name Type Description Default
session VoiceSession

The active session.

required

disconnect abstractmethod async

disconnect(session)

Disconnect a session from the provider.

Parameters:

Name Type Description Default
session VoiceSession

The session to disconnect.

required

reconfigure async

reconfigure(session, *, system_prompt=None, voice=None, tools=None, temperature=None, provider_config=None)

Reconfigure a session with new parameters.

Used during agent handoff to switch the AI personality, voice, and tools. The default implementation disconnects and reconnects; providers with session resumption (e.g. Gemini Live) should override to preserve conversation context.

Parameters:

Name Type Description Default
session VoiceSession

The active session to reconfigure.

required
system_prompt str | None

New system instructions.

None
voice str | None

New voice ID.

None
tools list[dict[str, Any]] | None

New tool/function definitions.

None
temperature float | None

New sampling temperature.

None
provider_config dict[str, Any] | None

Provider-specific configuration overrides.

None

send_event async

send_event(session, event)

Send a raw provider-specific event to the underlying service.

This is an escape hatch for sending protocol-level messages that are not covered by the standard provider API (e.g. OpenAI's session.update or input_audio_buffer.commit).

The default implementation raises :exc:NotImplementedError. Providers that support raw events should override this.

Parameters:

Name Type Description Default
session VoiceSession

The active session.

required
event dict[str, Any]

A JSON-serializable dict that will be sent verbatim to the provider's underlying connection.

required

close async

close()

Release all provider resources.

on_audio

on_audio(callback)

Register callback for audio output from the provider.

Parameters:

Name Type Description Default
callback RealtimeAudioCallback

Called with (session, audio_bytes) when the provider produces audio output.

required

on_transcription

on_transcription(callback)

Register callback for transcription events.

Parameters:

Name Type Description Default
callback RealtimeTranscriptionCallback

Called with (session, text, role, is_final).

required

on_speech_start

on_speech_start(callback)

Register callback for speech start detection.

Parameters:

Name Type Description Default
callback RealtimeSpeechStartCallback

Called with (session) when user speech is detected.

required

on_speech_end

on_speech_end(callback)

Register callback for speech end detection.

Parameters:

Name Type Description Default
callback RealtimeSpeechEndCallback

Called with (session) when user speech ends.

required

on_tool_call

on_tool_call(callback)

Register callback for tool/function calls from the AI.

Parameters:

Name Type Description Default
callback RealtimeToolCallCallback

Called with (session, call_id, name, arguments).

required

on_response_start

on_response_start(callback)

Register callback for when the AI starts generating a response.

Parameters:

Name Type Description Default
callback RealtimeResponseStartCallback

Called with (session).

required

on_response_end

on_response_end(callback)

Register callback for when the AI finishes a response.

Parameters:

Name Type Description Default
callback RealtimeResponseEndCallback

Called with (session).

required

on_error

on_error(callback)

Register callback for provider errors.

Parameters:

Name Type Description Default
callback RealtimeErrorCallback

Called with (session, code, message).

required

Transport ABC

All realtime transports extend VoiceBackend.

Session & State

See VoiceSession and VoiceSessionState.

Events

RealtimeTranscriptionEvent dataclass

RealtimeTranscriptionEvent(session, text, role, is_final, item_id=None, timestamp=_utcnow())

Transcription produced by the realtime provider.

Fired through ON_TRANSCRIPTION hooks. For final transcriptions, the channel emits a RoomEvent so other channels see the conversation.

session instance-attribute

session

The realtime session that produced this transcription.

text instance-attribute

text

The transcribed text.

role instance-attribute

role

Who spoke: 'user' (input) or 'assistant' (AI output).

is_final instance-attribute

is_final

True if this is a final transcription (not interim).

item_id class-attribute instance-attribute

item_id = None

Provider-specific item ID for correlation.

timestamp class-attribute instance-attribute

timestamp = field(default_factory=_utcnow)

When the transcription was received.

RealtimeSpeechEvent dataclass

RealtimeSpeechEvent(session, type, timestamp=_utcnow())

Speech activity detected by the realtime provider's server-side VAD.

session instance-attribute

session

The realtime session where speech activity changed.

type instance-attribute

type

Whether speech started or ended.

timestamp class-attribute instance-attribute

timestamp = field(default_factory=_utcnow)

When the speech event was detected.

RealtimeToolCallEvent dataclass

RealtimeToolCallEvent(session, tool_call_id, name, arguments, timestamp=_utcnow())

The realtime provider is requesting a function call.

Fired through ON_REALTIME_TOOL_CALL hooks (sync). The hook must return a result that gets submitted back to the provider.

session instance-attribute

session

The realtime session requesting the tool call.

tool_call_id instance-attribute

tool_call_id

Provider-assigned ID for this tool call.

name instance-attribute

name

The function name being called.

arguments instance-attribute

arguments

Parsed arguments for the function call.

timestamp class-attribute instance-attribute

timestamp = field(default_factory=_utcnow)

When the tool call was received.

RealtimeErrorEvent dataclass

RealtimeErrorEvent(session, code, message, timestamp=_utcnow())

Error from the realtime provider.

session instance-attribute

session

The realtime session that encountered the error.

code instance-attribute

code

Error code from the provider.

message instance-attribute

message

Human-readable error description.

timestamp class-attribute instance-attribute

timestamp = field(default_factory=_utcnow)

When the error occurred.

Callback Types

Callback Signature
RealtimeAudioCallback (RealtimeSession, bytes) -> Any
RealtimeTranscriptionCallback (RealtimeSession, str, str, bool) -> Any
RealtimeSpeechStartCallback (RealtimeSession) -> Any
RealtimeSpeechEndCallback (RealtimeSession) -> Any
RealtimeToolCallCallback (RealtimeSession, str, str, dict) -> Any
RealtimeResponseStartCallback (RealtimeSession) -> Any
RealtimeResponseEndCallback (RealtimeSession) -> Any
RealtimeErrorCallback (RealtimeSession, str, str) -> Any

Concrete Providers

Gemini Live

from roomkit.providers.gemini.realtime import GeminiLiveProvider

provider = GeminiLiveProvider(
    api_key="...",
    model="gemini-2.5-flash-native-audio-preview-12-2025",
)

Install with: pip install roomkit[realtime-gemini]

Supports provider_config metadata keys: language, top_p, top_k, max_output_tokens, seed, enable_affective_dialog, thinking_budget, proactive_audio, start_of_speech_sensitivity, end_of_speech_sensitivity, silence_duration_ms, prefix_padding_ms, no_interruption.

OpenAI Realtime

from roomkit.providers.openai.realtime import OpenAIRealtimeProvider

provider = OpenAIRealtimeProvider(
    api_key="...",
    model="gpt-4o-realtime-preview",
)

Install with: pip install roomkit[realtime-openai]

Transports

WebSocket Transport

from roomkit.voice.realtime.ws_transport import WebSocketRealtimeTransport

transport = WebSocketRealtimeTransport()

Handles bidirectional audio over WebSocket. Client sends {"type": "audio", "data": "<base64 PCM>"}, server sends audio, transcriptions, and speaking indicators as JSON messages.

Install with: pip install roomkit[websocket]

FastRTC Transport (WebRTC)

from roomkit.voice.realtime.fastrtc_transport import (
    FastRTCRealtimeTransport,
    mount_fastrtc_realtime,
)

transport = FastRTCRealtimeTransport(
    input_sample_rate=16000,   # Mic input sample rate
    output_sample_rate=24000,  # Provider output sample rate
)

# Mount WebRTC endpoints on FastAPI app
mount_fastrtc_realtime(app, transport, path="/rtc-realtime")

WebRTC-based transport using FastRTC in passthrough mode (no VAD). Audio flows bidirectionally between the browser and the speech-to-speech AI provider, which handles its own server-side VAD.

Unlike the FastRTCVoiceBackend (which uses ReplyOnPause for VAD in the traditional VoiceChannel pipeline), this transport simply passes audio through — ideal for speech-to-speech providers like OpenAI Realtime or Gemini Live.

Install with: pip install roomkit[fastrtc]

Connection flow:

  1. mount_fastrtc_realtime(app, transport) creates a FastRTC Stream with a passthrough handler
  2. Browser connects via WebRTC → FastRTC calls handler.copy()start_up()
  3. start_up() reads webrtc_id from FastRTC context, registers with transport
  4. Transport fires on_client_connected callback (if set) — use to auto-create sessions
  5. App calls channel.start_session(room_id, participant_id, connection=webrtc_id)
  6. start_session()transport.accept(session, webrtc_id) maps session to handler
  7. receive()/emit() now flow audio with session context

Audio format conversion:

FastRTC works with (sample_rate, numpy.ndarray[int16]) tuples. The transport ABC uses raw bytes (PCM16 LE). Conversion is automatic:

  • Inbound: audio_array.tobytes() → fire callbacks with bytes
  • Outbound: np.frombuffer(audio_bytes, dtype=np.int16) → return as (rate, array)

DataChannel messages:

JSON messages (transcriptions, speaking indicators) are sent via the WebRTC DataChannel using send_message().

Transport Protocol VAD Dependency
WebSocketRealtimeTransport WebSocket Provider-side roomkit[websocket]
FastRTCRealtimeTransport WebRTC Provider-side roomkit[fastrtc]

Lazy-loaded via roomkit.voice.get_fastrtc_realtime_transport() and roomkit.voice.get_mount_fastrtc_realtime() to avoid requiring fastrtc/numpy at import time.

Mock Classes

MockRealtimeProvider

MockRealtimeProvider()

Bases: RealtimeVoiceProvider

Mock realtime voice provider for testing.

Tracks all method calls and provides helpers to simulate provider events (transcriptions, audio, tool calls, etc.).

Example

provider = MockRealtimeProvider()

Track calls

await provider.connect(session, system_prompt="Hello") assert provider.calls[-1].method == "connect"

Simulate events

await provider.simulate_transcription(session, "Hi there", "user", True) await provider.simulate_audio(session, b"audio-data") await provider.simulate_tool_call(session, "call-1", "get_weather", {"city": "NYC"})

simulate_audio async

simulate_audio(session, audio)

Simulate audio output from the provider.

simulate_transcription async

simulate_transcription(session, text, role='assistant', is_final=True)

Simulate a transcription event from the provider.

simulate_speech_start async

simulate_speech_start(session)

Simulate speech start detection.

simulate_speech_end async

simulate_speech_end(session)

Simulate speech end detection.

simulate_tool_call async

simulate_tool_call(session, call_id, name, arguments=None)

Simulate a tool call from the provider.

simulate_response_start async

simulate_response_start(session)

Simulate response generation starting.

simulate_response_end async

simulate_response_end(session)

Simulate response generation ending.

simulate_error async

simulate_error(session, code, message)

Simulate a provider error.

MockRealtimeTransport

MockRealtimeTransport()

Bases: VoiceBackend

Mock realtime audio transport for testing.

Tracks all method calls and provides helpers to simulate client events (audio, disconnection).

Example

transport = MockRealtimeTransport()

await transport.accept(session, "fake-ws") assert transport.calls[-1].method == "accept"

Simulate client audio

await transport.simulate_client_audio(session, b"audio-data")

send_message async

send_message(session, message)

Send a JSON message (not part of VoiceBackend ABC, used by tests).

simulate_client_audio async

simulate_client_audio(session, audio)

Simulate audio received from the client browser.

simulate_client_disconnect async

simulate_client_disconnect(session)

Simulate the client disconnecting.

Usage Example

from roomkit import RoomKit, RealtimeVoiceChannel
from roomkit.providers.gemini.realtime import GeminiLiveProvider
from roomkit.voice.realtime.ws_transport import WebSocketRealtimeTransport

# Configure provider and transport
provider = GeminiLiveProvider(api_key="...")
transport = WebSocketRealtimeTransport()

# Create and register the channel
channel = RealtimeVoiceChannel(
    "realtime-voice",
    provider=provider,
    transport=transport,
    system_prompt="You are a helpful voice assistant.",
    voice="Aoede",
)

kit = RoomKit()
kit.register_channel(channel)

# Create room and attach
room = await kit.create_room(room_id="voice-room")
await kit.attach_channel("voice-room", "realtime-voice")

# When a WebSocket connection arrives:
session = await channel.start_session(
    "voice-room", "participant-1", websocket
)

# Transcriptions are emitted as RoomEvents —
# other channels in the room see the conversation.

# Text from other channels is injected into the AI session:
# supervisor types "Offer 20% discount" via WebSocket →
# AI incorporates it into its next spoken response.

With Tool Calling (MCP)

from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client

# Connect to MCP server
read, write, _ = await streamablehttp_client("http://localhost:9998/mcp")
mcp = ClientSession(read, write)
await mcp.initialize()

# Discover tools
tools_result = await mcp.list_tools()
tools = [
    {
        "name": t.name,
        "description": t.description or t.name,
        "parameters": t.inputSchema or {"type": "object", "properties": {}},
    }
    for t in tools_result.tools
]

# Tool handler routes calls to MCP
async def handle_tool(session, name, arguments):
    result = await mcp.call_tool(name, arguments)
    return {"result": "\n".join(b.text for b in result.content if hasattr(b, "text"))}

channel = RealtimeVoiceChannel(
    "realtime-voice",
    provider=provider,
    transport=transport,
    system_prompt="You are a helpful assistant with access to tools.",
    tools=tools,
    tool_handler=handle_tool,
)

With FastRTC (WebRTC)

from fastapi import FastAPI
from roomkit import RoomKit, RealtimeVoiceChannel
from roomkit.providers.gemini.realtime import GeminiLiveProvider
from roomkit.voice.realtime.fastrtc_transport import (
    FastRTCRealtimeTransport,
    mount_fastrtc_realtime,
)

app = FastAPI()

provider = GeminiLiveProvider(api_key="...")
transport = FastRTCRealtimeTransport(
    input_sample_rate=16000,
    output_sample_rate=24000,
)

channel = RealtimeVoiceChannel(
    "realtime-voice",
    provider=provider,
    transport=transport,
    system_prompt="You are a helpful voice assistant.",
    voice="Aoede",
)

kit = RoomKit()
kit.register_channel(channel)

# Mount WebRTC endpoints
mount_fastrtc_realtime(app, transport, path="/rtc-realtime")

# Auto-create sessions when WebRTC clients connect
async def on_connected(webrtc_id: str):
    room = await kit.create_room()
    await kit.attach_channel(room.id, "realtime-voice")
    await channel.start_session(room.id, "user-1", connection=webrtc_id)

transport.on_client_connected(on_connected)