Realtime Voice¶
Speech-to-speech AI conversations using providers like Google Gemini Live, OpenAI Realtime, and xAI Grok Realtime. Audio flows directly between the client and the AI provider; transcriptions are emitted into the Room so other channels (supervisor dashboards, logging) see the conversation.
RealtimeVoiceChannel¶
RealtimeVoiceChannel ¶
RealtimeVoiceChannel(channel_id, *, provider, transport, system_prompt=None, voice=None, tools=None, temperature=None, input_sample_rate=16000, output_sample_rate=24000, transport_sample_rate=None, emit_transcription_events=True, tool_handler=None, mute_on_tool_call=False, tool_result_max_length=16384, pipeline=None, recording=None, skills=None, script_executor=None)
Bases: RealtimeToolsMixin, RealtimeTranscriptionMixin, RealtimeSpeechMixin, RealtimeAudioMixin, RealtimeResponseMixin, VoicePipelineMixin, Channel
Real-time voice channel using speech-to-speech AI providers.
Wraps APIs like OpenAI Realtime and Gemini Live as a first-class RoomKit channel. Audio flows directly between the user's browser and the provider; transcriptions are emitted into the Room so other channels (supervisor dashboards, logging) see the conversation.
Category is TRANSPORT so that:
- on_event() receives broadcasts (for text injection from supervisors)
- deliver() is called but returns empty (customer is on voice)
Example
from roomkit.voice.realtime.mock import MockRealtimeProvider, MockRealtimeTransport
provider = MockRealtimeProvider() transport = MockRealtimeTransport()
channel = RealtimeVoiceChannel( "realtime-1", provider=provider, transport=transport, system_prompt="You are a helpful agent.", ) kit.register_channel(channel)
Initialize realtime voice channel.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
channel_id
|
str
|
Unique channel identifier. |
required |
provider
|
RealtimeVoiceProvider
|
The realtime voice provider (OpenAI, Gemini, etc.). |
required |
transport
|
VoiceBackend
|
The audio transport (WebSocket, etc.). |
required |
system_prompt
|
str | None
|
Default system prompt for the AI. |
None
|
voice
|
str | None
|
Default voice ID for audio output. |
None
|
tools
|
list[dict[str, Any] | Any] | None
|
Tool definitions as dicts, or Tool objects with
|
None
|
temperature
|
float | None
|
Default sampling temperature. |
None
|
input_sample_rate
|
int
|
Default input audio sample rate (Hz). |
16000
|
output_sample_rate
|
int
|
Default output audio sample rate (Hz). |
24000
|
transport_sample_rate
|
int | None
|
Sample rate of audio from the transport (Hz).
When set and different from provider rates, enables automatic
resampling. When |
None
|
emit_transcription_events
|
bool
|
If True, emit final transcriptions as RoomEvents so other channels see them. |
True
|
tool_handler
|
ToolHandler | None
|
Async callable to execute tool calls.
Signature: |
None
|
mute_on_tool_call
|
bool
|
If True, mute the transport microphone during
tool execution to prevent barge-in that causes providers
(e.g. Gemini) to silently drop the tool result. Defaults
to False — use |
False
|
tool_result_max_length
|
int
|
Maximum character length of tool results before truncation. Large results (e.g. SVG payloads) can overflow the provider's context window. Defaults to 16384. |
16384
|
pipeline
|
AudioPipelineConfig | None
|
Optional |
None
|
recording
|
Any | None
|
Optional |
None
|
skills
|
SkillRegistry | None
|
Optional |
None
|
script_executor
|
ScriptExecutor | None
|
Optional |
None
|
wait_idle
async
¶
Wait until all sessions in the room are idle (not speaking).
An idle session has finished its last response and all audio has been forwarded to the transport.
set_framework ¶
Set the framework reference for event routing.
Called automatically when the channel is registered with RoomKit.
on_trace ¶
Register a trace observer and bridge to the transport.
configure ¶
Update channel defaults for future sessions.
Active sessions are not affected — use reconfigure_session
for those.
inject_text
async
¶
Inject a text turn into the provider session.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The active voice session. |
required |
text
|
str
|
Text to inject. |
required |
role
|
str
|
Role for the text ('user' or 'system'). |
'user'
|
silent
|
bool
|
If True, add to conversation context without requesting a response. The agent sees the text on its next turn but does not react immediately. |
False
|
start_session
async
¶
Start a new realtime voice session.
Connects both the transport (client audio) and the provider (AI service), then fires a framework event.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
room_id
|
str
|
The room to join. |
required |
participant_id
|
str
|
The participant's ID. |
required |
connection
|
Any
|
Protocol-specific connection (e.g. WebSocket). |
required |
metadata
|
dict[str, Any] | None
|
Optional session metadata. May include overrides for system_prompt, voice, tools, temperature. |
None
|
Returns:
| Type | Description |
|---|---|
VoiceSession
|
The created VoiceSession. |
end_session
async
¶
End a realtime voice session.
Disconnects both provider and transport, fires framework event.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The session to end. |
required |
reconfigure_session
async
¶
reconfigure_session(session, *, system_prompt=None, voice=None, tools=None, temperature=None, provider_config=None)
Reconfigure an active session with new agent parameters.
Used during agent handoff to switch the AI personality, voice, and tools. Providers with session resumption (e.g. Gemini Live) preserve conversation history across the reconfiguration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The active session to reconfigure. |
required |
system_prompt
|
str | None
|
New system instructions for the AI. |
None
|
voice
|
str | None
|
New voice ID for audio output. |
None
|
tools
|
list[dict[str, Any]] | None
|
New tool/function definitions. |
None
|
temperature
|
float | None
|
New sampling temperature. |
None
|
provider_config
|
dict[str, Any] | None
|
Provider-specific configuration overrides. |
None
|
connect_session
async
¶
Accept a realtime voice session via process_inbound.
Delegates to :meth:start_session which handles provider/transport
connection, resampling, and framework events.
disconnect_session
async
¶
Clean up realtime sessions on remote disconnect.
update_binding ¶
Update cached bindings for all sessions in a room.
Called by the framework after mute/unmute/set_access so the
audio gate in _pipeline_on_audio_received (pipeline path)
or _forward_client_audio (direct path) sees the new state.
handle_inbound
async
¶
Not used directly — audio flows via start_session.
on_event
async
¶
React to events from other channels — TEXT INJECTION.
When a supervisor or other channel sends a message, extract the text and inject it into the provider session so the AI incorporates it. Skips events from this channel (self-loop prevention).
deliver
async
¶
No-op delivery — customer is on voice, can't see text.
ToolHandler¶
Async callback for executing tool/function calls from the AI provider. Receives (tool_name, arguments) and returns a JSON string.
Provider ABC¶
RealtimeVoiceProvider ¶
Bases: ABC
Abstract base class for speech-to-speech AI providers.
Wraps APIs like OpenAI Realtime and Gemini Live that handle audio-in → audio-out with built-in AI, VAD, and transcription.
The provider manages a bidirectional audio/event stream with the AI service. Callbacks are registered for events the provider emits.
Example
provider = OpenAIRealtimeProvider(api_key="sk-...", model="gpt-realtime-1.5")
provider.on_audio(handle_audio) provider.on_transcription(handle_transcription) provider.on_tool_call(handle_tool_call)
await provider.connect(session, system_prompt="You are a helpful agent.") await provider.send_audio(session, audio_bytes) await provider.disconnect(session)
connect
abstractmethod
async
¶
connect(session, *, system_prompt=None, voice=None, tools=None, temperature=None, input_sample_rate=16000, output_sample_rate=24000, server_vad=True, provider_config=None)
Connect a session to the provider's AI service.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The realtime session to connect. |
required |
system_prompt
|
str | None
|
System instructions for the AI. |
None
|
voice
|
str | None
|
Voice ID for audio output. |
None
|
tools
|
list[dict[str, Any]] | None
|
Tool/function definitions the AI can call. |
None
|
temperature
|
float | None
|
Sampling temperature. |
None
|
input_sample_rate
|
int
|
Sample rate of input audio (Hz). |
16000
|
output_sample_rate
|
int
|
Desired sample rate for output audio (Hz). |
24000
|
server_vad
|
bool
|
Whether to use server-side voice activity detection. |
True
|
provider_config
|
dict[str, Any] | None
|
Provider-specific configuration options. Each provider documents which keys it accepts. |
None
|
send_audio
abstractmethod
async
¶
Send audio data to the provider for processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The active session. |
required |
audio
|
bytes
|
Raw PCM audio bytes. |
required |
inject_text
abstractmethod
async
¶
Inject text into the conversation (e.g. supervisor guidance).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The active session. |
required |
text
|
str
|
Text to inject. |
required |
role
|
str
|
Role for the injected text ('user' or 'system'). |
'user'
|
silent
|
bool
|
If True, add to conversation context without requesting a response. The agent sees the text on its next turn but does not react immediately. |
False
|
submit_tool_result
abstractmethod
async
¶
Submit a tool call result back to the provider.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The active session. |
required |
call_id
|
str
|
The tool call ID from the on_tool_call callback. |
required |
result
|
str
|
JSON-serialized result string. |
required |
interrupt
abstractmethod
async
¶
Interrupt the current AI response.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The active session. |
required |
disconnect
abstractmethod
async
¶
Disconnect a session from the provider.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The session to disconnect. |
required |
reconfigure
async
¶
reconfigure(session, *, system_prompt=None, voice=None, tools=None, temperature=None, provider_config=None)
Reconfigure a session with new parameters.
Used during agent handoff to switch the AI personality, voice, and tools. The default implementation disconnects and reconnects; providers with session resumption (e.g. Gemini Live) should override to preserve conversation context.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The active session to reconfigure. |
required |
system_prompt
|
str | None
|
New system instructions. |
None
|
voice
|
str | None
|
New voice ID. |
None
|
tools
|
list[dict[str, Any]] | None
|
New tool/function definitions. |
None
|
temperature
|
float | None
|
New sampling temperature. |
None
|
provider_config
|
dict[str, Any] | None
|
Provider-specific configuration overrides. |
None
|
send_event
async
¶
Send a raw provider-specific event to the underlying service.
This is an escape hatch for sending protocol-level messages that
are not covered by the standard provider API (e.g. OpenAI's
session.update or input_audio_buffer.commit).
The default implementation raises :exc:NotImplementedError.
Providers that support raw events should override this.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The active session. |
required |
event
|
dict[str, Any]
|
A JSON-serializable dict that will be sent verbatim to the provider's underlying connection. |
required |
is_responding ¶
Check if the provider is actively generating a response.
Returns True between response.created and response.done.
on_audio ¶
Register callback for audio output from the provider.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
RealtimeAudioCallback
|
Called with (session, audio_bytes) when the provider produces audio output. |
required |
on_transcription ¶
Register callback for transcription events.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
RealtimeTranscriptionCallback
|
Called with (session, text, role, is_final). |
required |
on_speech_start ¶
Register callback for speech start detection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
RealtimeSpeechStartCallback
|
Called with (session) when user speech is detected. |
required |
on_speech_end ¶
Register callback for speech end detection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
RealtimeSpeechEndCallback
|
Called with (session) when user speech ends. |
required |
on_tool_call ¶
Register callback for tool/function calls from the AI.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
RealtimeToolCallCallback
|
Called with (session, call_id, name, arguments). |
required |
on_response_start ¶
Register callback for when the AI starts generating a response.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
RealtimeResponseStartCallback
|
Called with (session). |
required |
on_response_end ¶
Register callback for when the AI finishes a response.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
RealtimeResponseEndCallback
|
Called with (session). |
required |
on_error ¶
Register callback for provider errors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
RealtimeErrorCallback
|
Called with (session, code, message). |
required |
send_activity_start
async
¶
Signal that user speech activity has started.
Used in manual VAD mode: local VAD detects speech and the channel
calls this to inform the provider. The provider translates this
into its protocol's activity signal (e.g. Gemini ActivityStart).
Default: no-op. Override when the provider supports manual mode.
send_activity_end
async
¶
Signal that user speech activity has ended.
Used in manual VAD mode: local VAD detects silence and the channel
calls this to inform the provider. The provider translates this
into its protocol's activity signal (e.g. Gemini ActivityEnd).
Default: no-op. Override when the provider supports manual mode.
Transport ABC¶
All realtime transports extend VoiceBackend.
Session & State¶
See VoiceSession and VoiceSessionState.
Events¶
RealtimeTranscriptionEvent
dataclass
¶
RealtimeTranscriptionEvent(session, text, role, is_final, was_barge_in=False, item_id=None, timestamp=_utcnow())
Transcription produced by the realtime provider.
Fired through ON_TRANSCRIPTION hooks. For final transcriptions, the channel emits a RoomEvent so other channels see the conversation.
was_barge_in
class-attribute
instance-attribute
¶
True if this transcription resulted from a barge-in (user interrupted the AI while it was speaking).
item_id
class-attribute
instance-attribute
¶
Provider-specific item ID for correlation.
timestamp
class-attribute
instance-attribute
¶
When the transcription was received.
RealtimeSpeechEvent
dataclass
¶
Speech activity detected by the realtime provider's server-side VAD.
RealtimeToolCallEvent
dataclass
¶
Deprecated — use :class:roomkit.models.tool_call.ToolCallEvent instead.
Kept for backward compatibility. The unified ON_TOOL_CALL hook
fires :class:ToolCallEvent from both AIChannel and RealtimeVoiceChannel.
timestamp
class-attribute
instance-attribute
¶
When the tool call was received.
RealtimeErrorEvent
dataclass
¶
Error from the realtime provider.
timestamp
class-attribute
instance-attribute
¶
When the error occurred.
Callback Types¶
| Callback | Signature |
|---|---|
RealtimeAudioCallback |
(RealtimeSession, bytes) -> Any |
RealtimeTranscriptionCallback |
(RealtimeSession, str, str, bool) -> Any |
RealtimeSpeechStartCallback |
(RealtimeSession) -> Any |
RealtimeSpeechEndCallback |
(RealtimeSession) -> Any |
RealtimeToolCallCallback |
(RealtimeSession, str, str, dict) -> Any |
RealtimeResponseStartCallback |
(RealtimeSession) -> Any |
RealtimeResponseEndCallback |
(RealtimeSession) -> Any |
RealtimeErrorCallback |
(RealtimeSession, str, str) -> Any |
Concrete Providers¶
Gemini Live¶
from roomkit.providers.gemini.realtime import GeminiLiveProvider
provider = GeminiLiveProvider(
api_key="...",
model="gemini-2.5-flash-native-audio-preview-12-2025",
)
Install with: pip install roomkit[realtime-gemini]
Supports provider_config metadata keys: language, top_p, top_k, max_output_tokens, seed, enable_affective_dialog, thinking_budget, proactive_audio, start_of_speech_sensitivity, end_of_speech_sensitivity, silence_duration_ms, prefix_padding_ms, no_interruption.
OpenAI Realtime¶
from roomkit.providers.openai.realtime import OpenAIRealtimeProvider
provider = OpenAIRealtimeProvider(
api_key="...",
model="gpt-4o-realtime-preview",
)
Install with: pip install roomkit[realtime-openai]
xAI Grok Realtime¶
from roomkit.providers.xai.config import XAIRealtimeConfig
from roomkit.providers.xai.realtime import XAIRealtimeProvider
provider = XAIRealtimeProvider(
XAIRealtimeConfig(
api_key="...",
model="grok-3-fast",
voice="eve",
transcription_model="grok-2-audio",
)
)
Install with: pip install websockets
Supports provider_config metadata keys: turn_detection_type, threshold, silence_duration_ms, prefix_padding_ms, transcription_model, input_audio_format, output_audio_format, modalities.
Native tools: web_search, x_search (passed as {"type": "web_search"}).
Available voices: eve, ara, rex, sal, leo.
Transports¶
WebSocket Transport¶
from roomkit.voice.realtime.ws_transport import WebSocketRealtimeTransport
transport = WebSocketRealtimeTransport()
Handles bidirectional audio over WebSocket. Client sends {"type": "audio", "data": "<base64 PCM>"}, server sends audio, transcriptions, and speaking indicators as JSON messages.
Install with: pip install roomkit[websocket]
FastRTC Transport (WebRTC)¶
from roomkit.voice.realtime.fastrtc_transport import (
FastRTCRealtimeTransport,
mount_fastrtc_realtime,
)
transport = FastRTCRealtimeTransport(
input_sample_rate=16000, # Mic input sample rate
output_sample_rate=24000, # Provider output sample rate
)
# Mount WebRTC endpoints on FastAPI app
mount_fastrtc_realtime(app, transport, path="/rtc-realtime")
WebRTC-based transport using FastRTC in passthrough mode (no VAD). Audio flows bidirectionally between the browser and the speech-to-speech AI provider, which handles its own server-side VAD.
Unlike the FastRTCVoiceBackend (which uses ReplyOnPause for VAD in the traditional VoiceChannel pipeline), this transport simply passes audio through — ideal for speech-to-speech providers like OpenAI Realtime or Gemini Live.
Install with: pip install roomkit[fastrtc]
Connection flow:
mount_fastrtc_realtime(app, transport)creates a FastRTCStreamwith a passthrough handler- Browser connects via WebRTC → FastRTC calls
handler.copy()→start_up() start_up()readswebrtc_idfrom FastRTC context, registers with transport- Transport fires
on_client_connectedcallback (if set) — use to auto-create sessions - App calls
channel.start_session(room_id, participant_id, connection=webrtc_id) start_session()→transport.accept(session, webrtc_id)maps session to handlerreceive()/emit()now flow audio with session context
Audio format conversion:
FastRTC works with (sample_rate, numpy.ndarray[int16]) tuples. The transport ABC uses raw bytes (PCM16 LE). Conversion is automatic:
- Inbound:
audio_array.tobytes()→ fire callbacks with bytes - Outbound:
np.frombuffer(audio_bytes, dtype=np.int16)→ return as(rate, array)
DataChannel messages:
JSON messages (transcriptions, speaking indicators) are sent via the WebRTC DataChannel using send_message().
| Transport | Protocol | VAD | Dependency |
|---|---|---|---|
WebSocketRealtimeTransport |
WebSocket | Provider-side | roomkit[websocket] |
FastRTCRealtimeTransport |
WebRTC | Provider-side | roomkit[fastrtc] |
Lazy-loaded via roomkit.voice.get_fastrtc_realtime_transport() and roomkit.voice.get_mount_fastrtc_realtime() to avoid requiring fastrtc/numpy at import time.
Mock Classes¶
MockRealtimeProvider ¶
Bases: RealtimeVoiceProvider
Mock realtime voice provider for testing.
Tracks all method calls and provides helpers to simulate provider events (transcriptions, audio, tool calls, etc.).
Example
provider = MockRealtimeProvider()
Track calls¶
await provider.connect(session, system_prompt="Hello") assert provider.calls[-1].method == "connect"
Simulate events¶
await provider.simulate_transcription(session, "Hi there", "user", True) await provider.simulate_audio(session, b"audio-data") await provider.simulate_tool_call(session, "call-1", "get_weather", {"city": "NYC"})
MockRealtimeTransport ¶
Bases: VoiceBackend
Mock realtime audio transport for testing.
Tracks all method calls and provides helpers to simulate client events (audio, disconnection).
Example
transport = MockRealtimeTransport()
await transport.accept(session, "fake-ws") assert transport.calls[-1].method == "accept"
Simulate client audio¶
await transport.simulate_client_audio(session, b"audio-data")
Usage Example¶
from roomkit import RoomKit, RealtimeVoiceChannel
from roomkit.providers.gemini.realtime import GeminiLiveProvider
from roomkit.voice.realtime.ws_transport import WebSocketRealtimeTransport
# Configure provider and transport
provider = GeminiLiveProvider(api_key="...")
transport = WebSocketRealtimeTransport()
# Create and register the channel
channel = RealtimeVoiceChannel(
"realtime-voice",
provider=provider,
transport=transport,
system_prompt="You are a helpful voice assistant.",
voice="Aoede",
)
kit = RoomKit()
kit.register_channel(channel)
# Create room and attach
room = await kit.create_room(room_id="voice-room")
await kit.attach_channel("voice-room", "realtime-voice")
# When a WebSocket connection arrives:
session = await channel.start_session(
"voice-room", "participant-1", websocket
)
# Transcriptions are emitted as RoomEvents —
# other channels in the room see the conversation.
# Text from other channels is injected into the AI session:
# supervisor types "Offer 20% discount" via WebSocket →
# AI incorporates it into its next spoken response.
With Tool Calling¶
import json
from roomkit import Tool
async def lookup_order(order_id: str) -> str:
return json.dumps({"status": "shipped", "eta": "Tomorrow"})
order_tool = Tool(
name="lookup_order",
description="Look up an order by ID",
parameters={
"type": "object",
"properties": {"order_id": {"type": "string"}},
"required": ["order_id"],
},
handler=lookup_order,
)
channel = RealtimeVoiceChannel(
"realtime-voice",
provider=provider,
transport=transport,
system_prompt="You are a helpful assistant with access to tools.",
tools=[order_tool],
)
With MCP (Advanced)¶
For MCP integration, use tool_handler to route calls to the MCP server:
import json
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client
# Connect to MCP server
read, write, _ = await streamablehttp_client("http://localhost:9998/mcp")
mcp = ClientSession(read, write)
await mcp.initialize()
# Discover tools
tools_result = await mcp.list_tools()
tools = [
{
"name": t.name,
"description": t.description or t.name,
"parameters": t.inputSchema or {"type": "object", "properties": {}},
}
for t in tools_result.tools
]
# Tool handler routes calls to MCP
async def handle_tool(name, arguments):
result = await mcp.call_tool(name, arguments)
return json.dumps({"result": "\n".join(b.text for b in result.content if hasattr(b, "text"))})
channel = RealtimeVoiceChannel(
"realtime-voice",
provider=provider,
transport=transport,
system_prompt="You are a helpful assistant with access to tools.",
tools=tools,
tool_handler=handle_tool,
)
With Delegation¶
Wire the delegate_task tool into a voice agent:
from roomkit.tasks import DelegateHandler, setup_realtime_delegation, build_delegate_tool
handler = DelegateHandler(kit, notify="realtime-voice")
tool = build_delegate_tool([("exec-agent", "Runs background tasks")])
setup_realtime_delegation(channel, handler, tool=tool)
See setup_realtime_delegation() for details.
With Vision Injection¶
Wire screen/camera vision into the voice session:
from roomkit import setup_realtime_vision
setup_realtime_vision(kit, room_id="room-1", voice_channel_id="realtime-voice")
Vision descriptions are injected via inject_text(silent=True) — the agent sees the context on its next turn without interrupting the conversation. Unchanged descriptions are automatically deduplicated.
See setup_realtime_vision() for details.
With FastRTC (WebRTC)¶
from fastapi import FastAPI
from roomkit import RoomKit, RealtimeVoiceChannel
from roomkit.providers.gemini.realtime import GeminiLiveProvider
from roomkit.voice.realtime.fastrtc_transport import (
FastRTCRealtimeTransport,
mount_fastrtc_realtime,
)
app = FastAPI()
provider = GeminiLiveProvider(api_key="...")
transport = FastRTCRealtimeTransport(
input_sample_rate=16000,
output_sample_rate=24000,
)
channel = RealtimeVoiceChannel(
"realtime-voice",
provider=provider,
transport=transport,
system_prompt="You are a helpful voice assistant.",
voice="Aoede",
)
kit = RoomKit()
kit.register_channel(channel)
# Mount WebRTC endpoints
mount_fastrtc_realtime(app, transport, path="/rtc-realtime")
# Auto-create sessions when WebRTC clients connect
async def on_connected(webrtc_id: str):
room = await kit.create_room()
await kit.attach_channel(room.id, "realtime-voice")
await channel.start_session(room.id, "user-1", connection=webrtc_id)
transport.on_client_connected(on_connected)