Realtime Voice¶
Speech-to-speech AI conversations using providers like Google Gemini Live and OpenAI Realtime. Audio flows directly between the client and the AI provider; transcriptions are emitted into the Room so other channels (supervisor dashboards, logging) see the conversation.
RealtimeVoiceChannel¶
RealtimeVoiceChannel ¶
RealtimeVoiceChannel(channel_id, *, provider, transport, system_prompt=None, voice=None, tools=None, temperature=None, input_sample_rate=16000, output_sample_rate=24000, transport_sample_rate=None, emit_transcription_events=True, tool_handler=None, mute_on_tool_call=False, tool_result_max_length=16384)
Bases: Channel
Real-time voice channel using speech-to-speech AI providers.
Wraps APIs like OpenAI Realtime and Gemini Live as a first-class RoomKit channel. Audio flows directly between the user's browser and the provider; transcriptions are emitted into the Room so other channels (supervisor dashboards, logging) see the conversation.
Category is TRANSPORT so that:
- on_event() receives broadcasts (for text injection from supervisors)
- deliver() is called but returns empty (customer is on voice)
Example
from roomkit.voice.realtime.mock import MockRealtimeProvider, MockRealtimeTransport
provider = MockRealtimeProvider() transport = MockRealtimeTransport()
channel = RealtimeVoiceChannel( "realtime-1", provider=provider, transport=transport, system_prompt="You are a helpful agent.", ) kit.register_channel(channel)
Initialize realtime voice channel.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
channel_id
|
str
|
Unique channel identifier. |
required |
provider
|
RealtimeVoiceProvider
|
The realtime voice provider (OpenAI, Gemini, etc.). |
required |
transport
|
VoiceBackend
|
The audio transport (WebSocket, etc.). |
required |
system_prompt
|
str | None
|
Default system prompt for the AI. |
None
|
voice
|
str | None
|
Default voice ID for audio output. |
None
|
tools
|
list[dict[str, Any]] | None
|
Default tool/function definitions. |
None
|
temperature
|
float | None
|
Default sampling temperature. |
None
|
input_sample_rate
|
int
|
Default input audio sample rate (Hz). |
16000
|
output_sample_rate
|
int
|
Default output audio sample rate (Hz). |
24000
|
transport_sample_rate
|
int | None
|
Sample rate of audio from the transport (Hz).
When set and different from provider rates, enables automatic
resampling. When |
None
|
emit_transcription_events
|
bool
|
If True, emit final transcriptions as RoomEvents so other channels see them. |
True
|
tool_handler
|
ToolHandler | None
|
Async callable to execute tool calls.
Signature: |
None
|
mute_on_tool_call
|
bool
|
If True, mute the transport microphone during
tool execution to prevent barge-in that causes providers
(e.g. Gemini) to silently drop the tool result. Defaults
to False — use |
False
|
tool_result_max_length
|
int
|
Maximum character length of tool results before truncation. Large results (e.g. SVG payloads) can overflow the provider's context window. Defaults to 16384. |
16384
|
set_framework ¶
Set the framework reference for event routing.
Called automatically when the channel is registered with RoomKit.
on_trace ¶
Register a trace observer and bridge to the transport.
inject_text
async
¶
Inject a text turn into the provider session.
Useful for nudging the provider when its server-side VAD stalls (e.g. Gemini ignoring valid speech after turn_complete).
start_session
async
¶
Start a new realtime voice session.
Connects both the transport (client audio) and the provider (AI service), then fires a framework event.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
room_id
|
str
|
The room to join. |
required |
participant_id
|
str
|
The participant's ID. |
required |
connection
|
Any
|
Protocol-specific connection (e.g. WebSocket). |
required |
metadata
|
dict[str, Any] | None
|
Optional session metadata. May include overrides for system_prompt, voice, tools, temperature. |
None
|
Returns:
| Type | Description |
|---|---|
VoiceSession
|
The created VoiceSession. |
end_session
async
¶
End a realtime voice session.
Disconnects both provider and transport, fires framework event.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The session to end. |
required |
reconfigure_session
async
¶
reconfigure_session(session, *, system_prompt=None, voice=None, tools=None, temperature=None, provider_config=None)
Reconfigure an active session with new agent parameters.
Used during agent handoff to switch the AI personality, voice, and tools. Providers with session resumption (e.g. Gemini Live) preserve conversation history across the reconfiguration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The active session to reconfigure. |
required |
system_prompt
|
str | None
|
New system instructions for the AI. |
None
|
voice
|
str | None
|
New voice ID for audio output. |
None
|
tools
|
list[dict[str, Any]] | None
|
New tool/function definitions. |
None
|
temperature
|
float | None
|
New sampling temperature. |
None
|
provider_config
|
dict[str, Any] | None
|
Provider-specific configuration overrides. |
None
|
connect_session
async
¶
Accept a realtime voice session via process_inbound.
Delegates to :meth:start_session which handles provider/transport
connection, resampling, and framework events.
disconnect_session
async
¶
Clean up realtime sessions on remote disconnect.
update_binding ¶
Update cached bindings for all sessions in a room.
Called by the framework after mute/unmute/set_access so the
audio gate in _forward_client_audio sees the new state.
handle_inbound
async
¶
Not used directly — audio flows via start_session.
on_event
async
¶
React to events from other channels — TEXT INJECTION.
When a supervisor or other channel sends a message, extract the text and inject it into the provider session so the AI incorporates it. Skips events from this channel (self-loop prevention).
deliver
async
¶
No-op delivery — customer is on voice, can't see text.
ToolHandler¶
Async callback for executing tool/function calls from the AI provider. Receives (session, tool_name, arguments) and returns a result dict or JSON string.
Provider ABC¶
RealtimeVoiceProvider ¶
Bases: ABC
Abstract base class for speech-to-speech AI providers.
Wraps APIs like OpenAI Realtime and Gemini Live that handle audio-in → audio-out with built-in AI, VAD, and transcription.
The provider manages a bidirectional audio/event stream with the AI service. Callbacks are registered for events the provider emits.
Example
provider = OpenAIRealtimeProvider(api_key="sk-...", model="gpt-4o-realtime")
provider.on_audio(handle_audio) provider.on_transcription(handle_transcription) provider.on_tool_call(handle_tool_call)
await provider.connect(session, system_prompt="You are a helpful agent.") await provider.send_audio(session, audio_bytes) await provider.disconnect(session)
connect
abstractmethod
async
¶
connect(session, *, system_prompt=None, voice=None, tools=None, temperature=None, input_sample_rate=16000, output_sample_rate=24000, server_vad=True, provider_config=None)
Connect a session to the provider's AI service.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The realtime session to connect. |
required |
system_prompt
|
str | None
|
System instructions for the AI. |
None
|
voice
|
str | None
|
Voice ID for audio output. |
None
|
tools
|
list[dict[str, Any]] | None
|
Tool/function definitions the AI can call. |
None
|
temperature
|
float | None
|
Sampling temperature. |
None
|
input_sample_rate
|
int
|
Sample rate of input audio (Hz). |
16000
|
output_sample_rate
|
int
|
Desired sample rate for output audio (Hz). |
24000
|
server_vad
|
bool
|
Whether to use server-side voice activity detection. |
True
|
provider_config
|
dict[str, Any] | None
|
Provider-specific configuration options. Each provider documents which keys it accepts. |
None
|
send_audio
abstractmethod
async
¶
Send audio data to the provider for processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The active session. |
required |
audio
|
bytes
|
Raw PCM audio bytes. |
required |
inject_text
abstractmethod
async
¶
Inject text into the conversation (e.g. supervisor guidance).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The active session. |
required |
text
|
str
|
Text to inject. |
required |
role
|
str
|
Role for the injected text ('user' or 'system'). |
'user'
|
submit_tool_result
abstractmethod
async
¶
Submit a tool call result back to the provider.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The active session. |
required |
call_id
|
str
|
The tool call ID from the on_tool_call callback. |
required |
result
|
str
|
JSON-serialized result string. |
required |
interrupt
abstractmethod
async
¶
Interrupt the current AI response.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The active session. |
required |
disconnect
abstractmethod
async
¶
Disconnect a session from the provider.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The session to disconnect. |
required |
reconfigure
async
¶
reconfigure(session, *, system_prompt=None, voice=None, tools=None, temperature=None, provider_config=None)
Reconfigure a session with new parameters.
Used during agent handoff to switch the AI personality, voice, and tools. The default implementation disconnects and reconnects; providers with session resumption (e.g. Gemini Live) should override to preserve conversation context.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The active session to reconfigure. |
required |
system_prompt
|
str | None
|
New system instructions. |
None
|
voice
|
str | None
|
New voice ID. |
None
|
tools
|
list[dict[str, Any]] | None
|
New tool/function definitions. |
None
|
temperature
|
float | None
|
New sampling temperature. |
None
|
provider_config
|
dict[str, Any] | None
|
Provider-specific configuration overrides. |
None
|
send_event
async
¶
Send a raw provider-specific event to the underlying service.
This is an escape hatch for sending protocol-level messages that
are not covered by the standard provider API (e.g. OpenAI's
session.update or input_audio_buffer.commit).
The default implementation raises :exc:NotImplementedError.
Providers that support raw events should override this.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The active session. |
required |
event
|
dict[str, Any]
|
A JSON-serializable dict that will be sent verbatim to the provider's underlying connection. |
required |
on_audio ¶
Register callback for audio output from the provider.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
RealtimeAudioCallback
|
Called with (session, audio_bytes) when the provider produces audio output. |
required |
on_transcription ¶
Register callback for transcription events.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
RealtimeTranscriptionCallback
|
Called with (session, text, role, is_final). |
required |
on_speech_start ¶
Register callback for speech start detection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
RealtimeSpeechStartCallback
|
Called with (session) when user speech is detected. |
required |
on_speech_end ¶
Register callback for speech end detection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
RealtimeSpeechEndCallback
|
Called with (session) when user speech ends. |
required |
on_tool_call ¶
Register callback for tool/function calls from the AI.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
RealtimeToolCallCallback
|
Called with (session, call_id, name, arguments). |
required |
on_response_start ¶
Register callback for when the AI starts generating a response.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
RealtimeResponseStartCallback
|
Called with (session). |
required |
on_response_end ¶
Register callback for when the AI finishes a response.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
RealtimeResponseEndCallback
|
Called with (session). |
required |
on_error ¶
Register callback for provider errors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
RealtimeErrorCallback
|
Called with (session, code, message). |
required |
Transport ABC¶
All realtime transports extend VoiceBackend.
Session & State¶
See VoiceSession and VoiceSessionState.
Events¶
RealtimeTranscriptionEvent
dataclass
¶
Transcription produced by the realtime provider.
Fired through ON_TRANSCRIPTION hooks. For final transcriptions, the channel emits a RoomEvent so other channels see the conversation.
RealtimeSpeechEvent
dataclass
¶
Speech activity detected by the realtime provider's server-side VAD.
RealtimeToolCallEvent
dataclass
¶
The realtime provider is requesting a function call.
Fired through ON_REALTIME_TOOL_CALL hooks (sync). The hook must return a result that gets submitted back to the provider.
timestamp
class-attribute
instance-attribute
¶
When the tool call was received.
RealtimeErrorEvent
dataclass
¶
Error from the realtime provider.
timestamp
class-attribute
instance-attribute
¶
When the error occurred.
Callback Types¶
| Callback | Signature |
|---|---|
RealtimeAudioCallback |
(RealtimeSession, bytes) -> Any |
RealtimeTranscriptionCallback |
(RealtimeSession, str, str, bool) -> Any |
RealtimeSpeechStartCallback |
(RealtimeSession) -> Any |
RealtimeSpeechEndCallback |
(RealtimeSession) -> Any |
RealtimeToolCallCallback |
(RealtimeSession, str, str, dict) -> Any |
RealtimeResponseStartCallback |
(RealtimeSession) -> Any |
RealtimeResponseEndCallback |
(RealtimeSession) -> Any |
RealtimeErrorCallback |
(RealtimeSession, str, str) -> Any |
Concrete Providers¶
Gemini Live¶
from roomkit.providers.gemini.realtime import GeminiLiveProvider
provider = GeminiLiveProvider(
api_key="...",
model="gemini-2.5-flash-native-audio-preview-12-2025",
)
Install with: pip install roomkit[realtime-gemini]
Supports provider_config metadata keys: language, top_p, top_k, max_output_tokens, seed, enable_affective_dialog, thinking_budget, proactive_audio, start_of_speech_sensitivity, end_of_speech_sensitivity, silence_duration_ms, prefix_padding_ms, no_interruption.
OpenAI Realtime¶
from roomkit.providers.openai.realtime import OpenAIRealtimeProvider
provider = OpenAIRealtimeProvider(
api_key="...",
model="gpt-4o-realtime-preview",
)
Install with: pip install roomkit[realtime-openai]
Transports¶
WebSocket Transport¶
from roomkit.voice.realtime.ws_transport import WebSocketRealtimeTransport
transport = WebSocketRealtimeTransport()
Handles bidirectional audio over WebSocket. Client sends {"type": "audio", "data": "<base64 PCM>"}, server sends audio, transcriptions, and speaking indicators as JSON messages.
Install with: pip install roomkit[websocket]
FastRTC Transport (WebRTC)¶
from roomkit.voice.realtime.fastrtc_transport import (
FastRTCRealtimeTransport,
mount_fastrtc_realtime,
)
transport = FastRTCRealtimeTransport(
input_sample_rate=16000, # Mic input sample rate
output_sample_rate=24000, # Provider output sample rate
)
# Mount WebRTC endpoints on FastAPI app
mount_fastrtc_realtime(app, transport, path="/rtc-realtime")
WebRTC-based transport using FastRTC in passthrough mode (no VAD). Audio flows bidirectionally between the browser and the speech-to-speech AI provider, which handles its own server-side VAD.
Unlike the FastRTCVoiceBackend (which uses ReplyOnPause for VAD in the traditional VoiceChannel pipeline), this transport simply passes audio through — ideal for speech-to-speech providers like OpenAI Realtime or Gemini Live.
Install with: pip install roomkit[fastrtc]
Connection flow:
mount_fastrtc_realtime(app, transport)creates a FastRTCStreamwith a passthrough handler- Browser connects via WebRTC → FastRTC calls
handler.copy()→start_up() start_up()readswebrtc_idfrom FastRTC context, registers with transport- Transport fires
on_client_connectedcallback (if set) — use to auto-create sessions - App calls
channel.start_session(room_id, participant_id, connection=webrtc_id) start_session()→transport.accept(session, webrtc_id)maps session to handlerreceive()/emit()now flow audio with session context
Audio format conversion:
FastRTC works with (sample_rate, numpy.ndarray[int16]) tuples. The transport ABC uses raw bytes (PCM16 LE). Conversion is automatic:
- Inbound:
audio_array.tobytes()→ fire callbacks with bytes - Outbound:
np.frombuffer(audio_bytes, dtype=np.int16)→ return as(rate, array)
DataChannel messages:
JSON messages (transcriptions, speaking indicators) are sent via the WebRTC DataChannel using send_message().
| Transport | Protocol | VAD | Dependency |
|---|---|---|---|
WebSocketRealtimeTransport |
WebSocket | Provider-side | roomkit[websocket] |
FastRTCRealtimeTransport |
WebRTC | Provider-side | roomkit[fastrtc] |
Lazy-loaded via roomkit.voice.get_fastrtc_realtime_transport() and roomkit.voice.get_mount_fastrtc_realtime() to avoid requiring fastrtc/numpy at import time.
Mock Classes¶
MockRealtimeProvider ¶
Bases: RealtimeVoiceProvider
Mock realtime voice provider for testing.
Tracks all method calls and provides helpers to simulate provider events (transcriptions, audio, tool calls, etc.).
Example
provider = MockRealtimeProvider()
Track calls¶
await provider.connect(session, system_prompt="Hello") assert provider.calls[-1].method == "connect"
Simulate events¶
await provider.simulate_transcription(session, "Hi there", "user", True) await provider.simulate_audio(session, b"audio-data") await provider.simulate_tool_call(session, "call-1", "get_weather", {"city": "NYC"})
MockRealtimeTransport ¶
Bases: VoiceBackend
Mock realtime audio transport for testing.
Tracks all method calls and provides helpers to simulate client events (audio, disconnection).
Example
transport = MockRealtimeTransport()
await transport.accept(session, "fake-ws") assert transport.calls[-1].method == "accept"
Simulate client audio¶
await transport.simulate_client_audio(session, b"audio-data")
Usage Example¶
from roomkit import RoomKit, RealtimeVoiceChannel
from roomkit.providers.gemini.realtime import GeminiLiveProvider
from roomkit.voice.realtime.ws_transport import WebSocketRealtimeTransport
# Configure provider and transport
provider = GeminiLiveProvider(api_key="...")
transport = WebSocketRealtimeTransport()
# Create and register the channel
channel = RealtimeVoiceChannel(
"realtime-voice",
provider=provider,
transport=transport,
system_prompt="You are a helpful voice assistant.",
voice="Aoede",
)
kit = RoomKit()
kit.register_channel(channel)
# Create room and attach
room = await kit.create_room(room_id="voice-room")
await kit.attach_channel("voice-room", "realtime-voice")
# When a WebSocket connection arrives:
session = await channel.start_session(
"voice-room", "participant-1", websocket
)
# Transcriptions are emitted as RoomEvents —
# other channels in the room see the conversation.
# Text from other channels is injected into the AI session:
# supervisor types "Offer 20% discount" via WebSocket →
# AI incorporates it into its next spoken response.
With Tool Calling (MCP)¶
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client
# Connect to MCP server
read, write, _ = await streamablehttp_client("http://localhost:9998/mcp")
mcp = ClientSession(read, write)
await mcp.initialize()
# Discover tools
tools_result = await mcp.list_tools()
tools = [
{
"name": t.name,
"description": t.description or t.name,
"parameters": t.inputSchema or {"type": "object", "properties": {}},
}
for t in tools_result.tools
]
# Tool handler routes calls to MCP
async def handle_tool(session, name, arguments):
result = await mcp.call_tool(name, arguments)
return {"result": "\n".join(b.text for b in result.content if hasattr(b, "text"))}
channel = RealtimeVoiceChannel(
"realtime-voice",
provider=provider,
transport=transport,
system_prompt="You are a helpful assistant with access to tools.",
tools=tools,
tool_handler=handle_tool,
)
With FastRTC (WebRTC)¶
from fastapi import FastAPI
from roomkit import RoomKit, RealtimeVoiceChannel
from roomkit.providers.gemini.realtime import GeminiLiveProvider
from roomkit.voice.realtime.fastrtc_transport import (
FastRTCRealtimeTransport,
mount_fastrtc_realtime,
)
app = FastAPI()
provider = GeminiLiveProvider(api_key="...")
transport = FastRTCRealtimeTransport(
input_sample_rate=16000,
output_sample_rate=24000,
)
channel = RealtimeVoiceChannel(
"realtime-voice",
provider=provider,
transport=transport,
system_prompt="You are a helpful voice assistant.",
voice="Aoede",
)
kit = RoomKit()
kit.register_channel(channel)
# Mount WebRTC endpoints
mount_fastrtc_realtime(app, transport, path="/rtc-realtime")
# Auto-create sessions when WebRTC clients connect
async def on_connected(webrtc_id: str):
room = await kit.create_room()
await kit.attach_channel(room.id, "realtime-voice")
await channel.start_session(room.id, "user-1", connection=webrtc_id)
transport.on_client_connected(on_connected)