FastRTC Voice Backend¶
RoomKit provides two FastRTC-based voice backends for browser-to-server real-time audio. Both use the FastRTC library (by Gradio) as the underlying transport.
| Backend | Transport | Use case | VAD |
|---|---|---|---|
FastRTCVoiceBackend |
WebSocket | Traditional STT/TTS pipeline | Client-side (pipeline) |
FastRTCRealtimeTransport |
WebRTC | Speech-to-speech AI (Gemini Live, OpenAI Realtime) | Server-side (provider) |
Installation¶
This installs fastrtc and numpy as dependencies.
FastRTCVoiceBackend (WebSocket)¶
The traditional voice pipeline path. Audio flows through VAD, STT, AI, and TTS stages on the server.
Architecture¶
Browser mic → WebSocket → FastRTCVoiceBackend
→ AudioPipeline: [Resampler] → [AEC] → [Denoiser] → VAD
→ STT (Deepgram, sherpa-onnx, etc.)
→ AI (Claude, GPT, etc.)
→ TTS (ElevenLabs, etc.)
→ mu-law encode → WebSocket → Browser speaker
Quick start¶
from fastapi import FastAPI
from contextlib import asynccontextmanager
from roomkit import RoomKit, VoiceChannel, AIChannel, ChannelCategory
from roomkit.providers.anthropic.ai import AnthropicAIProvider
from roomkit.providers.anthropic.config import AnthropicConfig
from roomkit.voice.backends.fastrtc import FastRTCVoiceBackend, mount_fastrtc_voice
from roomkit.voice.pipeline import AudioPipelineConfig
from roomkit.voice.pipeline.vad.energy import EnergyVADProvider
from roomkit.voice.stt.deepgram import DeepgramConfig, DeepgramSTTProvider
from roomkit.voice.tts.elevenlabs import ElevenLabsConfig, ElevenLabsTTSProvider
kit = RoomKit()
backend = FastRTCVoiceBackend(
input_sample_rate=48000, # Browser mic rate
output_sample_rate=24000, # TTS output rate
)
vad = EnergyVADProvider(energy_threshold=300.0, silence_threshold_ms=600)
pipeline = AudioPipelineConfig(vad=vad)
stt = DeepgramSTTProvider(config=DeepgramConfig(api_key="...", model="nova-3"))
tts = ElevenLabsTTSProvider(config=ElevenLabsConfig(api_key="..."))
voice = VoiceChannel("voice", stt=stt, tts=tts, backend=backend, pipeline=pipeline)
kit.register_channel(voice)
ai = AIChannel("ai", provider=AnthropicAIProvider(AnthropicConfig(api_key="...")))
kit.register_channel(ai)
async def session_factory(websocket_id: str):
"""Auto-create room + session when a browser connects."""
room = await kit.create_room()
await kit.attach_channel(room.id, "voice")
await kit.attach_channel(room.id, "ai", category=ChannelCategory.INTELLIGENCE)
# Pull model: kit.join() creates the session, binds it, and wires recording
session = await kit.join(room.id, "voice", participant_id="browser-user")
session.metadata["websocket_id"] = websocket_id
return session
@asynccontextmanager
async def lifespan(app: FastAPI):
mount_fastrtc_voice(app, backend, path="/voice", session_factory=session_factory)
yield
await kit.close()
app = FastAPI(lifespan=lifespan)
Constructor parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
input_sample_rate |
int |
48000 |
Browser microphone sample rate. FastRTC defaults to 48kHz. |
output_sample_rate |
int |
24000 |
TTS output sample rate. Must match your TTS provider's native rate. |
audio_queue_maxsize |
int |
1000 |
Per-session outbound audio queue depth. |
mount_fastrtc_voice parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
app |
FastAPI |
required | FastAPI application instance. |
backend |
FastRTCVoiceBackend |
required | The backend instance. |
path |
str |
"/fastrtc" |
Base path for the FastRTC endpoints. |
session_factory |
async (str) → VoiceSession |
None |
Called with websocket_id when a client connects. If not provided, sessions must be created manually before clients connect. |
auth |
async (WebSocket) → dict \| None |
None |
Authentication callback. Return a metadata dict to accept, None to reject. |
Endpoints created¶
When mounted at /voice, FastRTC creates:
/voice/websocket/offer— WebSocket endpoint for audio streaming/voice/webrtc/offer— WebRTC offer endpoint (POST)/voice/ui— Built-in Gradio UI (useful for quick testing)
Session lifecycle¶
1. Browser connects to /voice/websocket/offer
2. Browser sends: {"event": "start", "websocket_id": "abc123"}
3. session_factory("abc123") → creates Room + VoiceSession
4. Backend registers WebSocket → fires on_session_ready callbacks
5. Audio frames flow: browser → backend → pipeline → STT → AI → TTS → browser
6. Browser disconnects → session cleanup
FastRTCRealtimeTransport (WebRTC)¶
The speech-to-speech path. Audio passes through to the AI provider (Gemini Live, OpenAI Realtime) which handles VAD and response generation server-side.
Architecture¶
Browser mic → WebRTC → FastRTCRealtimeTransport
→ Raw PCM bytes → Provider (Gemini Live / OpenAI Realtime)
→ Provider generates audio + transcriptions
→ mu-law encode → DataChannel → Browser speaker
Quick start¶
from fastapi import FastAPI
from contextlib import asynccontextmanager
from roomkit import RoomKit, RealtimeVoiceChannel
from roomkit.providers.gemini.realtime import GeminiLiveProvider
from roomkit.voice.realtime.fastrtc_transport import (
FastRTCRealtimeTransport,
mount_fastrtc_realtime,
)
kit = RoomKit()
provider = GeminiLiveProvider(
api_key="...",
model="gemini-2.5-flash-native-audio-preview-12-2025",
)
transport = FastRTCRealtimeTransport(
input_sample_rate=16000,
output_sample_rate=24000,
)
channel = RealtimeVoiceChannel(
"realtime-voice",
provider=provider,
transport=transport,
system_prompt="You are a helpful assistant.",
voice="Aoede",
)
kit.register_channel(channel)
async def on_client_connected(webrtc_id: str) -> None:
room = await kit.create_room()
await kit.attach_channel(room.id, "realtime-voice")
await channel.start_session(room.id, "user-1", connection=webrtc_id)
transport.on_client_connected(on_client_connected)
@asynccontextmanager
async def lifespan(app: FastAPI):
mount_fastrtc_realtime(app, transport, path="/rtc-realtime")
yield
await kit.close()
app = FastAPI(lifespan=lifespan)
Constructor parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
input_sample_rate |
int |
16000 |
Browser microphone sample rate. |
output_sample_rate |
int |
24000 |
Provider output sample rate. |
mount_fastrtc_realtime parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
app |
FastAPI |
required | FastAPI application instance. |
transport |
FastRTCRealtimeTransport |
required | The transport instance. |
path |
str |
"/rtc-realtime" |
Base path for WebRTC endpoints. |
auth |
async (context) → dict \| None |
None |
Authentication callback. Receives the FastRTC context (with webrtc_id, connection info). Return metadata dict to accept, None to reject. |
Endpoints created¶
When mounted at /rtc-realtime:
/rtc-realtime/webrtc/offer— WebRTC SDP offer endpoint (POST)/rtc-realtime/ui— Built-in Gradio UI
Connection flow¶
1. Browser creates RTCPeerConnection
2. Browser creates DataChannel named "text" (required)
3. Browser sends SDP offer to /rtc-realtime/webrtc/offer
4. FastRTC negotiates ICE, DTLS, SRTP
5. handler.start_up() → transport registers handler → fires on_client_connected
6. App calls channel.start_session(room, participant, connection=webrtc_id)
7. Audio flows bidirectionally via WebRTC media tracks
8. Transcriptions sent via DataChannel as JSON
Audio format¶
Both backends use mu-law (G.711) encoding for outbound audio sent to the browser:
- PCM-16 LE → mu-law (4:1 compression ratio)
- Encoded as base64, wrapped in JSON:
{"event": "media", "media": {"payload": "..."}} - Pure-Python encoder (no
audioopdependency, compatible with Python 3.13+) - Pre-computed 16384-entry lookup table for O(1) per-sample encoding
Inbound audio from the browser arrives as:
- WebSocket mode: mu-law encoded, same JSON format
- WebRTC mode: PCM via WebRTC media tracks (decoded by FastRTC to numpy arrays)
Browser client¶
WebSocket connection (JavaScript)¶
const ws = new WebSocket('ws://localhost:8000/voice/websocket/offer');
const wsId = crypto.randomUUID();
ws.onopen = () => {
ws.send(JSON.stringify({ event: 'start', websocket_id: wsId }));
// Capture mic and send mu-law audio
navigator.mediaDevices.getUserMedia({ audio: true }).then(stream => {
const ctx = new AudioContext({ sampleRate: 48000 });
const source = ctx.createMediaStreamSource(stream);
const processor = ctx.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (e) => {
const float32 = e.inputBuffer.getChannelData(0);
const mulaw = encodeMulaw(float32); // PCM float → mu-law bytes
const b64 = btoa(String.fromCharCode(...mulaw));
ws.send(JSON.stringify({ event: 'media', media: { payload: b64 } }));
};
source.connect(processor);
processor.connect(ctx.destination);
});
};
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
// Play received audio
if (data.event === 'media') {
const mulaw = Uint8Array.from(atob(data.media.payload), c => c.charCodeAt(0));
// Decode mu-law → PCM → play via AudioContext
}
// Display transcriptions
if (data.type === 'transcription') {
console.log(`${data.data.role}: ${data.data.text}`);
}
};
WebRTC connection (JavaScript)¶
const pc = new RTCPeerConnection();
const webrtcId = crypto.randomUUID();
// Mic → peer connection
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
stream.getTracks().forEach(track => pc.addTrack(track, stream));
// Data channel for transcriptions (must be created before offer)
const dc = pc.createDataChannel('text');
dc.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === 'transcription') {
console.log(`${data.data.role}: ${data.data.text}`);
}
if (data.event === 'media') {
// Decode mu-law audio and play
}
};
// Remote audio
pc.ontrack = (event) => {
const audio = new Audio();
audio.srcObject = event.streams[0];
audio.play();
};
// ICE candidates
pc.onicecandidate = ({ candidate }) => {
if (candidate) {
fetch('/rtc-realtime/webrtc/offer', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
candidate: candidate.toJSON(),
webrtc_id: webrtcId,
type: 'ice-candidate',
}),
});
}
};
// Create and send offer
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
const resp = await fetch('/rtc-realtime/webrtc/offer', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ sdp: offer.sdp, type: offer.type, webrtc_id: webrtcId }),
});
const answer = await resp.json();
await pc.setRemoteDescription(answer);
Ready-made client
See examples/fastrtc_client.html for a complete browser client that supports both WebSocket and WebRTC modes with mu-law encoding/decoding, VU meter, and transcription display.
Authentication¶
Both backends support pluggable authentication via an auth callback.
WebSocket auth¶
async def authenticate(websocket) -> dict[str, object] | None:
"""Validate token from query string or headers."""
token = websocket.query_params.get("token")
if not token or not await verify_token(token):
return None # Reject connection
return {"user_id": "...", "role": "agent"} # Accept
mount_fastrtc_voice(app, backend, path="/voice", auth=authenticate,
session_factory=session_factory)
Auth metadata is available inside session_factory via the auth_context context variable:
from roomkit.voice.auth import auth_context
async def session_factory(websocket_id: str):
meta = auth_context.get() # {"user_id": "...", "role": "agent"}
room = await kit.create_room()
# ... use meta to customize session
WebRTC auth¶
async def authenticate(ctx) -> dict[str, object] | None:
"""Validate from FastRTC connection context."""
# ctx has webrtc_id and connection metadata
return {"authenticated": True}
mount_fastrtc_realtime(app, transport, path="/rtc-realtime", auth=authenticate)
Comparing the two backends¶
| Aspect | FastRTCVoiceBackend | FastRTCRealtimeTransport |
|---|---|---|
| Transport | WebSocket | WebRTC (ICE, DTLS, SRTP) |
| Audio codec | mu-law (both directions) | PCM inbound, mu-law outbound |
| VAD | Pipeline-side (EnergyVAD, SherpaOnnx, etc.) | Provider-side (Gemini/OpenAI) |
| STT/TTS | Separate providers (Deepgram + ElevenLabs) | Built into the provider |
| Channel type | VoiceChannel |
RealtimeVoiceChannel |
| Pipeline stages | Full (AEC, AGC, denoiser, VAD, diarization) | None (passthrough) |
| Latency | ~100-200ms (STT + AI + TTS) | ~50-150ms (speech-to-speech) |
| NAT traversal | N/A (WebSocket) | Full ICE with STUN/TURN |
| Use case | Custom STT/TTS stack, pipeline control | Low-latency speech-to-speech AI |
Capabilities¶
FastRTCVoiceBackend declares VoiceCapability.NONE — all intelligence is delegated to the AudioPipeline.
FastRTCRealtimeTransport is a passthrough — the provider handles speech detection, so no pipeline capabilities are needed.
Pipeline integration¶
The FastRTCVoiceBackend integrates with the full AudioPipeline. You can add any combination of pipeline stages:
from roomkit.voice.pipeline import AudioPipelineConfig, WavFileRecorder, RecordingConfig
from roomkit.voice.pipeline.aec.webrtc import WebRTCAECProvider
from roomkit.voice.pipeline.denoiser.rnnoise import RNNoiseDenoiserProvider
from roomkit.voice.pipeline.vad.sherpa_onnx import SherpaOnnxVADProvider, SherpaOnnxVADConfig
pipeline = AudioPipelineConfig(
vad=SherpaOnnxVADProvider(SherpaOnnxVADConfig(model="path/to/model.onnx")),
aec=WebRTCAECProvider(sample_rate=16000),
denoiser=RNNoiseDenoiserProvider(sample_rate=16000),
recorder=WavFileRecorder(),
recording_config=RecordingConfig(storage="./recordings"),
)
voice = VoiceChannel("voice", stt=stt, tts=tts, backend=backend, pipeline=pipeline)
The pipeline processes inbound audio in this order:
Examples¶
examples/voice_fastrtc.py— FastRTCVoiceBackend with Deepgram STT + Claude + ElevenLabs TTS (includes inline browser client)examples/realtime_voice_fastrtc.py— FastRTCRealtimeTransport with Gemini Liveexamples/fastrtc_client.html— Standalone browser client supporting both modes
API Reference¶
-
FastRTCVoiceBackend ¶
FastRTCVoiceBackend(*, input_sample_rate=48000, output_sample_rate=24000, audio_format='mulaw', audio_queue_maxsize=DEFAULT_QUEUE_MAXSIZE)Bases:
VoiceBackendVoiceBackend implementation using FastRTC for WebRTC and WebSocket transport.
Supports all three FastRTC transport modes (WebRTC, WebSocket, Telephone). Delivers raw audio frames via on_audio_received callback. All audio intelligence (VAD, denoising, diarization) is handled by the AudioPipeline.
Parameters:
Name Type Description Default input_sample_rateintExpected inbound sample rate.
48000output_sample_rateintTarget outbound sample rate.
24000audio_formatstrWebSocket audio encoding —
"mulaw"(mu-law in JSON, Twilio-compatible) or"pcm"(raw PCM-16 LE binary frames). Has no effect on WebRTC transport (always PCM via RTP).'mulaw'audio_queue_maxsizeintMax pending frames per session emit queue.
DEFAULT_QUEUE_MAXSIZEon_client_disconnected ¶
Register callback for client disconnection.
send_audio_sync ¶
Synchronously send a single audio chunk.
Thread-safe — can be called from any thread (e.g. SIP audio callback threads). Uses
call_soon_threadsafeto schedule operations on the event loop.- WebRTC: Puts a numpy frame into the session's emit queue
(consumed by the handler's
emit()→ FastRTC RTP track). - WebSocket (mulaw): mu-law + base64 in calling thread, then
schedules
send_jsonon the event loop. - WebSocket (pcm): schedules
send_byteson the event loop.
- WebRTC: Puts a numpy frame into the session's emit queue
(consumed by the handler's
-
FastRTCRealtimeTransport ¶
Bases:
VoiceBackendWebRTC-based realtime audio transport using FastRTC.
Uses FastRTC in passthrough mode (no VAD) for speech-to-speech AI providers. Audio flows bidirectionally between the browser (via WebRTC) and the provider, which handles its own server-side VAD.
Connection flow
mount_fastrtc_realtime(app, transport)creates a Stream- Browser connects via WebRTC -> FastRTC calls handler.copy() -> start_up()
- start_up() reads webrtc_id, registers handler with transport
- Transport fires
on_client_connectedcallback (if set) - App calls
channel.start_session(room_id, participant_id, connection=webrtc_id) - start_session() ->
transport.accept(session, webrtc_id)maps session to handler - receive()/emit() flow audio with session context
accept
async¶Accept a WebRTC connection for the given session.
Parameters:
Name Type Description Default sessionVoiceSessionThe realtime session to bind.
required connectionAnyThe webrtc_id string identifying the WebRTC connection.
required send_audio
async¶Send audio data to the connected WebRTC client.
Sends directly on the WebSocket, bypassing FastRTC's emit queue for minimal latency.
Parameters:
Name Type Description Default sessionVoiceSessionThe session to send audio to.
required audiobytes | AsyncIterator[AudioChunk]Raw PCM16 LE audio bytes or an async iterator of AudioChunks.
required send_message
async¶Send a JSON message via the WebRTC DataChannel.
Parameters:
Name Type Description Default sessionVoiceSessionThe session to send the message to.
required messagedict[str, Any]JSON-serializable message dict.
required disconnect
async¶Disconnect the client for the given session.
Removes all mappings and sends a None sentinel to the handler's audio queue to signal the end of the stream.
Parameters:
Name Type Description Default sessionVoiceSessionThe session to disconnect.
required on_audio_received ¶
Register callback for audio received from the client.
Parameters:
Name Type Description Default callbackAudioReceivedCallbackCalled with (session, audio_bytes).
required on_client_disconnected ¶
Register callback for client disconnection.
Parameters:
Name Type Description Default callbackTransportDisconnectCallbackCalled with (session) when the client disconnects.
required on_client_connected ¶
Register callback fired when a new WebRTC client connects.
Called with (webrtc_id: str). Use to auto-create sessions.
Parameters:
Name Type Description Default callbackCallable[[str], Any]Called with the webrtc_id when a client connects.
required
See Voice Channel API and Realtime Voice API for channel-level documentation.