Voice Providers¶
Voice Backend¶
VoiceBackend ¶
Bases: ABC
Abstract base class for voice transport backends.
VoiceBackend handles the transport layer for real-time audio: - Managing voice session connections - Streaming audio to/from clients - Delivering raw inbound audio frames via on_audio_received
The backend is framework-agnostic and a pure transport — all audio intelligence (VAD, denoising, diarization) is handled by the AudioPipeline.
Example usage
backend = WebRTCVoiceBackend()
Register raw audio callback¶
backend.on_audio_received(handle_audio_frame)
Connect a participant¶
session = await backend.connect("room-1", "user-1", "voice-channel")
Stream audio to the client¶
await backend.send_audio(session, audio_chunks)
Disconnect¶
await backend.disconnect(session)
capabilities
property
¶
Declare supported capabilities.
Override to enable features like interruption, barge-in, etc. By default, no optional capabilities are supported.
Returns:
| Type | Description |
|---|---|
VoiceCapability
|
Flags indicating supported capabilities. |
feeds_aec_reference
property
¶
Whether this backend feeds AEC reference at the transport level.
When True, the pipeline skips aec.feed_reference() in the
outbound path to avoid double-feeding. Transport-level feeding
(from the speaker callback) is preferred because it is
time-aligned with actual speaker output.
supports_playback_callback
property
¶
Whether this backend fires :meth:on_audio_played callbacks.
When True, the pipeline can rely on playback-time AEC reference
instead of generation-time feeding from process_outbound.
connect
async
¶
Create a new voice session for a participant.
Backends that initiate connections (VoiceChannel path) override
this. Backends that receive external connections (realtime
transport path) override :meth:accept instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
room_id
|
str
|
The room to join. |
required |
participant_id
|
str
|
The participant's ID. |
required |
channel_id
|
str
|
The voice channel ID. |
required |
metadata
|
dict[str, Any] | None
|
Optional session metadata. |
None
|
Returns:
| Type | Description |
|---|---|
VoiceSession
|
A VoiceSession representing the connection. |
disconnect
abstractmethod
async
¶
End a voice session.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The session to disconnect. |
required |
send_audio
abstractmethod
async
¶
Send audio to a voice session.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The target session. |
required |
audio
|
bytes | AsyncIterator[AudioChunk]
|
Raw audio bytes or an async iterator of AudioChunks for streaming. |
required |
get_session ¶
Get a session by ID.
Override for backends that track sessions internally.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session_id
|
str
|
The session ID to look up. |
required |
Returns:
| Type | Description |
|---|---|
VoiceSession | None
|
The VoiceSession if found, None otherwise. |
list_sessions ¶
List all active sessions in a room.
Override for backends that track sessions internally.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
room_id
|
str
|
The room to list sessions for. |
required |
Returns:
| Type | Description |
|---|---|
list[VoiceSession]
|
List of active VoiceSessions in the room. |
on_audio_received ¶
Register a callback for raw inbound audio frames.
The pipeline or channel calls this to receive every audio frame produced by the transport.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
AudioReceivedCallback
|
Function called with (session, audio_frame). |
required |
on_session_ready ¶
Register callback for when a session's audio path becomes live.
Fired when the transport is ready to send/receive audio for a
session (e.g. WebSocket connected, RTP socket active, mic stream
started). The VoiceChannel uses this together with
bind_session to implement the dual-signal
ON_SESSION_STARTED hook.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
SessionReadyCallback
|
Function called with |
required |
on_barge_in ¶
Register callback for barge-in detection.
Only called if capabilities includes BARGE_IN. Backends should call this when user starts speaking while audio is being played (TTS interruption).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
BargeInCallback
|
Function called with (session). |
required |
cancel_audio
async
¶
Cancel ongoing audio playback for a session.
Delegates to :meth:interrupt and returns True.
Subclasses may override for more nuanced behaviour.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The session to cancel audio for. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if audio was cancelled, False if nothing was playing. |
accept
async
¶
Bind an external connection to a session.
Backends that receive connections from external sources (e.g.
WebSocket, WebRTC, SIP) override this. Backends that create
their own connections override :meth:connect instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The voice session to bind. |
required |
connection
|
Any
|
Protocol-specific connection object. |
required |
set_input_muted ¶
Mute or unmute the input (microphone) for a session.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The session to mute/unmute. |
required |
muted
|
bool
|
|
required |
set_input_gated ¶
Gate or un-gate audio input for primary speaker mode.
When gated, audio is not forwarded to provider callbacks but may still be fed to a pipeline for diarization analysis.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The session to gate/un-gate. |
required |
gated
|
bool
|
|
required |
on_client_disconnected ¶
Register callback for client disconnection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
TransportDisconnectCallback
|
Called with (session) when the client disconnects. |
required |
on_speaker_change ¶
Register callback for speaker change events.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
SpeakerChangeCallback
|
Called with (session, diarization_result). |
required |
is_playing ¶
Check if audio is currently being sent to the session.
Used for barge-in detection to know if interruption is possible.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The session to check. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if audio is currently playing, False otherwise. |
on_audio_played ¶
Register a callback for audio frames as they are played.
Called with each audio frame at the moment it is output by the speaker, providing time-aligned reference for echo cancellation. The pipeline uses this to feed AEC reference at the correct time.
Note
Callbacks may be invoked from the audio I/O thread — implementations must be thread-safe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
callback
|
AudioPlayedCallback
|
Function called with (session, audio_frame). |
required |
set_trace_emitter ¶
Set a callback for emitting protocol traces.
Called by the owning channel when trace observers are registered. Implementations should store the emitter and call it at key protocol points (e.g. INVITE, BYE for SIP).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
emitter
|
Callable[..., Any] | None
|
The channel's :meth: |
required |
send_transcription
async
¶
Send transcription text to the client for UI display.
Optional method for backends that support sending text updates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
VoiceSession
|
The voice session to send to. |
required |
text
|
str
|
The transcribed or response text. |
required |
role
|
str
|
Either "user" (transcription) or "assistant" (AI response). |
'user'
|
VoiceCapability ¶
Bases: Flag
Capabilities a VoiceBackend can support.
Backends declare their capabilities via the capabilities property.
This allows RoomKit to know which features are available and
enables integrators to choose backends based on their needs.
Example
class MyBackend(VoiceBackend): @property def capabilities(self) -> VoiceCapability: return ( VoiceCapability.INTERRUPTION | VoiceCapability.BARGE_IN )
INTERRUPTION
class-attribute
instance-attribute
¶
Backend can cancel ongoing audio playback (cancel_audio).
BARGE_IN
class-attribute
instance-attribute
¶
Backend detects and handles barge-in (user interrupts TTS).
NATIVE_AEC
class-attribute
instance-attribute
¶
Backend provides its own Acoustic Echo Cancellation.
NATIVE_AGC
class-attribute
instance-attribute
¶
Backend provides its own Automatic Gain Control.
DTMF_INBAND
class-attribute
instance-attribute
¶
Backend can detect DTMF tones from the audio stream.
DTMF_SIGNALING
class-attribute
instance-attribute
¶
Backend receives DTMF via out-of-band signaling (e.g. SIP INFO).
VoiceSession
dataclass
¶
VoiceSession(id, room_id, participant_id, channel_id, state=CONNECTING, provider_session_id=None, created_at=_utcnow(), metadata=dict())
Active voice connection for a participant.
VoiceSessionState ¶
Bases: StrEnum
State of a voice session.
AudioChunk
dataclass
¶
AudioChunk(data, sample_rate=16000, channels=1, format='pcm_s16le', timestamp_ms=None, is_final=False)
A chunk of audio data for streaming (used for outbound TTS).
TranscriptionResult
dataclass
¶
TranscriptionResult(text, is_final=True, confidence=None, language=None, words=list(), is_speech_start=False)
Result from speech-to-text transcription.
is_speech_start
class-attribute
instance-attribute
¶
Set by providers with server-side VAD to signal speech detected.
STT (Speech-to-Text)¶
STTProvider ¶
Bases: ABC
Speech-to-text provider.
supports_streaming
property
¶
Whether this provider supports streaming transcription.
transcribe
abstractmethod
async
¶
Transcribe complete audio to text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio
|
AudioContent | AudioChunk | AudioFrame
|
Audio content (URL), raw audio chunk, or audio frame. |
required |
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
TranscriptionResult with text and metadata. |
transcribe_stream
async
¶
Stream transcription with partial results.
Override for providers that support streaming. Default: buffers all audio and returns single result.
MockSTTProvider ¶
Sherpa-ONNX STT¶
SherpaOnnxSTTProvider ¶
Bases: STTProvider
Speech-to-text provider using sherpa-onnx.
Supports transducer models for both streaming and batch recognition, and Whisper models for batch recognition only.
transcribe
async
¶
Transcribe complete audio.
For transducer mode, uses the OnlineRecognizer (feeds all audio then reads the result). For whisper mode, uses the OfflineRecognizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio
|
AudioContent | AudioChunk | AudioFrame
|
Audio content or raw audio chunk (PCM S16LE expected). |
required |
Returns:
| Type | Description |
|---|---|
TranscriptionResult
|
TranscriptionResult with text. |
transcribe_stream
async
¶
Stream transcription with partial results using OnlineRecognizer.
Only supported for transducer mode. Whisper mode raises ValueError.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_stream
|
AsyncIterator[AudioChunk]
|
Async iterator of audio chunks. |
required |
Yields:
| Type | Description |
|---|---|
AsyncIterator[TranscriptionResult]
|
TranscriptionResult with partial and final transcripts. |
SherpaOnnxSTTConfig
dataclass
¶
SherpaOnnxSTTConfig(mode='transducer', tokens='', encoder='', decoder='', joiner='', model_type='', language='en', task='transcribe', sample_rate=16000, num_threads=2, provider='cpu', enable_endpoint_detection=True, rule1_min_trailing_silence=2.4, rule2_min_trailing_silence=1.2, rule3_min_utterance_length=20.0)
Configuration for the sherpa-onnx STT provider.
Attributes:
| Name | Type | Description |
|---|---|---|
mode |
str
|
Recognition mode — |
tokens |
str
|
Path to |
encoder |
str
|
Path to encoder |
decoder |
str
|
Path to decoder |
joiner |
str
|
Path to joiner |
model_type |
str
|
Model type hint for sherpa-onnx (e.g.
|
language |
str
|
Language code (Whisper only). |
task |
str
|
Whisper task — |
sample_rate |
int
|
Expected audio sample rate. |
num_threads |
int
|
Number of CPU threads for inference. |
provider |
str
|
ONNX execution provider ( |
enable_endpoint_detection |
bool
|
Enable sherpa-onnx endpoint detection. Enabled by default. When VAD drives the stream lifecycle the VAD fires first (its silence threshold is shorter), so this is harmless in a pipeline and useful for standalone use. |
rule1_min_trailing_silence |
float
|
Endpoint rule 1 — minimum trailing silence (seconds) after speech to trigger endpoint. |
rule2_min_trailing_silence |
float
|
Endpoint rule 2 — minimum trailing silence (seconds) after speech with decoded text. |
rule3_min_utterance_length |
float
|
Endpoint rule 3 — minimum utterance length (seconds) to trigger endpoint regardless of silence. |
Usage¶
from roomkit.voice.stt.sherpa_onnx import SherpaOnnxSTTProvider, SherpaOnnxSTTConfig
# Transducer model (streaming + batch)
stt = SherpaOnnxSTTProvider(SherpaOnnxSTTConfig(
mode="transducer",
tokens="path/to/tokens.txt",
encoder="path/to/encoder.onnx",
decoder="path/to/decoder.onnx",
joiner="path/to/joiner.onnx",
))
# Whisper model (batch only)
stt = SherpaOnnxSTTProvider(SherpaOnnxSTTConfig(
mode="whisper",
tokens="path/to/tokens.txt",
encoder="path/to/encoder.onnx",
decoder="path/to/decoder.onnx",
language="en",
))
Install with: pip install roomkit[sherpa-onnx]
TTS (Text-to-Speech)¶
TTSProvider ¶
Bases: ABC
Text-to-speech provider.
supports_streaming_input
property
¶
Whether this TTS accepts streaming text input.
synthesize
abstractmethod
async
¶
Synthesize text to audio.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to synthesize. |
required |
voice
|
str | None
|
Voice ID (uses default_voice if not specified). |
None
|
Returns:
| Type | Description |
|---|---|
AudioContent
|
AudioContent with URL to generated audio. |
synthesize_stream_input
async
¶
Stream audio from streaming text chunks.
Override for providers that accept an async text stream as input.
synthesize_stream
async
¶
Stream audio chunks as they're generated.
Override for providers that support streaming. Default: synthesizes full audio and yields single chunk.
MockTTSProvider ¶
Sherpa-ONNX TTS¶
SherpaOnnxTTSProvider ¶
Bases: TTSProvider
Text-to-speech provider using sherpa-onnx with VITS/Piper models.
synthesize
async
¶
Synthesize text to audio.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to synthesize. |
required |
voice
|
str | None
|
Speaker ID as string (uses default if not specified). |
None
|
Returns:
| Type | Description |
|---|---|
AudioContent
|
AudioContent with a |
synthesize_stream
async
¶
Stream audio chunks using a callback bridge.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to synthesize. |
required |
voice
|
str | None
|
Speaker ID as string (uses default if not specified). |
None
|
Yields:
| Type | Description |
|---|---|
AsyncIterator[AudioChunk]
|
AudioChunk with PCM S16LE audio data. |
SherpaOnnxTTSConfig
dataclass
¶
SherpaOnnxTTSConfig(model='', tokens='', data_dir='', lexicon='', speaker_id=0, speed=1.0, sample_rate=22050, num_threads=2, provider='cpu')
Configuration for the sherpa-onnx TTS provider.
Attributes:
| Name | Type | Description |
|---|---|---|
model |
str
|
Path to VITS/Piper |
tokens |
str
|
Path to |
data_dir |
str
|
Path to espeak-ng data directory (Piper models). |
lexicon |
str
|
Path to optional lexicon file. |
speaker_id |
int
|
Speaker ID for multi-speaker models. |
speed |
float
|
Speech speed multiplier (1.0 = normal). |
sample_rate |
int
|
Output sample rate (usually determined by the model). |
num_threads |
int
|
Number of CPU threads for inference. |
provider |
str
|
ONNX execution provider ( |
Usage¶
from roomkit.voice.tts.sherpa_onnx import SherpaOnnxTTSProvider, SherpaOnnxTTSConfig
# VITS/Piper model with multi-speaker support
tts = SherpaOnnxTTSProvider(SherpaOnnxTTSConfig(
model="path/to/vits-model.onnx",
tokens="path/to/tokens.txt",
data_dir="path/to/espeak-ng-data", # for Piper models
speaker_id=0,
speed=1.0,
))
Install with: pip install roomkit[sherpa-onnx]
RTP Backend¶
RTPVoiceBackend ¶
RTPVoiceBackend(*, local_addr=('0.0.0.0', 0), remote_addr=None, payload_type=0, clock_rate=8000, dtmf_payload_type=101, rtcp_interval=5.0)
Bases: VoiceBackend
VoiceBackend that sends and receives audio over RTP.
Each :meth:connect call creates a new aiortp.RTPSession bound to
the configured local address and sending to the remote address.
Inbound audio is decoded by aiortp (G.711, L16, Opus) and delivered as
AudioFrame objects via the on_audio_received callback. DTMF
digits are received out-of-band via RFC 4733 and delivered via
on_dtmf_received callbacks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
local_addr
|
tuple[str, int]
|
|
('0.0.0.0', 0)
|
remote_addr
|
tuple[str, int] | None
|
|
None
|
payload_type
|
int
|
RTP payload type number (default |
0
|
clock_rate
|
int
|
Clock rate in Hz (default |
8000
|
dtmf_payload_type
|
int
|
RTP payload type for RFC 4733 DTMF events
(default |
101
|
rtcp_interval
|
float
|
Seconds between RTCP sender reports. |
5.0
|
Install with: pip install roomkit[rtp]
Mock Voice Backend¶
MockVoiceBackend ¶
MockVoiceBackend(*, capabilities=NONE)
Bases: VoiceBackend
Mock voice backend for testing.
Tracks all method calls and provides helpers to simulate events. The backend is a pure transport — no VAD or audio intelligence.
Example
backend = MockVoiceBackend()
Track calls¶
session = await backend.connect("room-1", "user-1", "voice-1") assert backend.calls[-1].method == "connect"
Simulate raw audio received¶
frame = AudioFrame(data=b"audio-data") await backend.simulate_audio_received(session, frame)
Simulate barge-in¶
await backend.simulate_barge_in(session)
simulate_audio_received
async
¶
Simulate the backend receiving a raw audio frame.
Fires all registered on_audio_received callbacks.
simulate_barge_in
async
¶
Simulate user speaking while TTS is playing (barge-in).
Fires all registered on_barge_in callbacks.
simulate_session_ready
async
¶
Simulate the backend signalling that a session's audio path is live.
Fires all registered on_session_ready callbacks.
simulate_client_disconnected
async
¶
Simulate the backend signalling that a client has disconnected.
Fires all registered on_client_disconnected callbacks.
MockVoiceCall
dataclass
¶
Record of a call made to MockVoiceBackend.
Voice Events¶
BargeInEvent
dataclass
¶
User started speaking while TTS was playing.
This event is fired when the VAD detects speech starting while audio is being sent to the user. This allows the system to: - Cancel the current TTS playback - Adjust response strategy (e.g., acknowledge interruption) - Track conversation dynamics
TTSCancelledEvent
dataclass
¶
TTS playback was cancelled.
This event is fired when TTS synthesis or playback is stopped before completion. Reasons include: - barge_in: User started speaking - explicit: Application called interrupt() - disconnect: Session ended - error: TTS or playback error
timestamp
class-attribute
instance-attribute
¶
When the cancellation occurred.
PartialTranscriptionEvent
dataclass
¶
Interim transcription result during speech.
This event is fired by backends that support streaming STT, providing real-time transcription updates before the final result. Use cases include: - Live captions/subtitles - Early intent detection - Visual feedback during speech
timestamp
class-attribute
instance-attribute
¶
When this transcription was received.
VADSilenceEvent
dataclass
¶
Silence detected after speech.
This event is fired when the VAD detects a period of silence following speech. It can be used for: - Early end-of-utterance detection (before full speech_end) - Adaptive silence thresholds - Turn-taking management
timestamp
class-attribute
instance-attribute
¶
When the silence was detected.
VADAudioLevelEvent
dataclass
¶
Periodic audio level update for UI feedback.
This event is fired periodically (typically 10Hz) to provide audio level information for UI visualization. Use cases include: - Audio level meters - Speaking indicators - Noise detection
timestamp
class-attribute
instance-attribute
¶
When this measurement was taken.
Callback Types¶
| Callback | Signature |
|---|---|
SpeechStartCallback |
(VoiceSession) -> Any |
SpeechEndCallback |
(VoiceSession, bytes) -> Any |
PartialTranscriptionCallback |
(VoiceSession, str, float, bool) -> Any |
VADSilenceCallback |
(VoiceSession, int) -> Any |
VADAudioLevelCallback |
(VoiceSession, float, bool) -> Any |
BargeInCallback |
(VoiceSession) -> Any |