Skip to content

Voice Providers

Voice Backend

VoiceBackend

Bases: ABC

Abstract base class for voice transport backends.

VoiceBackend handles the transport layer for real-time audio: - Managing voice session connections - Streaming audio to/from clients - Delivering raw inbound audio frames via on_audio_received

The backend is framework-agnostic and a pure transport — all audio intelligence (VAD, denoising, diarization) is handled by the AudioPipeline.

Example usage

backend = WebRTCVoiceBackend()

Register raw audio callback

backend.on_audio_received(handle_audio_frame)

Connect a participant

session = await backend.connect("room-1", "user-1", "voice-channel")

Stream audio to the client

await backend.send_audio(session, audio_chunks)

Disconnect

await backend.disconnect(session)

name abstractmethod property

name

Backend name (e.g., 'webrtc', 'websocket', 'livekit').

capabilities property

capabilities

Declare supported capabilities.

Override to enable features like interruption, barge-in, etc. By default, no optional capabilities are supported.

Returns:

Type Description
VoiceCapability

Flags indicating supported capabilities.

feeds_aec_reference property

feeds_aec_reference

Whether this backend feeds AEC reference at the transport level.

When True, the pipeline skips aec.feed_reference() in the outbound path to avoid double-feeding. Transport-level feeding (from the speaker callback) is preferred because it is time-aligned with actual speaker output.

supports_playback_callback property

supports_playback_callback

Whether this backend fires :meth:on_audio_played callbacks.

When True, the pipeline can rely on playback-time AEC reference instead of generation-time feeding from process_outbound.

connect async

connect(room_id, participant_id, channel_id, *, metadata=None)

Create a new voice session for a participant.

Backends that initiate connections (VoiceChannel path) override this. Backends that receive external connections (realtime transport path) override :meth:accept instead.

Parameters:

Name Type Description Default
room_id str

The room to join.

required
participant_id str

The participant's ID.

required
channel_id str

The voice channel ID.

required
metadata dict[str, Any] | None

Optional session metadata.

None

Returns:

Type Description
VoiceSession

A VoiceSession representing the connection.

disconnect abstractmethod async

disconnect(session)

End a voice session.

Parameters:

Name Type Description Default
session VoiceSession

The session to disconnect.

required

send_audio abstractmethod async

send_audio(session, audio)

Send audio to a voice session.

Parameters:

Name Type Description Default
session VoiceSession

The target session.

required
audio bytes | AsyncIterator[AudioChunk]

Raw audio bytes or an async iterator of AudioChunks for streaming.

required

get_session

get_session(session_id)

Get a session by ID.

Override for backends that track sessions internally.

Parameters:

Name Type Description Default
session_id str

The session ID to look up.

required

Returns:

Type Description
VoiceSession | None

The VoiceSession if found, None otherwise.

list_sessions

list_sessions(room_id)

List all active sessions in a room.

Override for backends that track sessions internally.

Parameters:

Name Type Description Default
room_id str

The room to list sessions for.

required

Returns:

Type Description
list[VoiceSession]

List of active VoiceSessions in the room.

close async

close()

Release backend resources.

Override in subclasses that need cleanup.

on_audio_received

on_audio_received(callback)

Register a callback for raw inbound audio frames.

The pipeline or channel calls this to receive every audio frame produced by the transport.

Parameters:

Name Type Description Default
callback AudioReceivedCallback

Function called with (session, audio_frame).

required

on_session_ready

on_session_ready(callback)

Register callback for when a session's audio path becomes live.

Fired when the transport is ready to send/receive audio for a session (e.g. WebSocket connected, RTP socket active, mic stream started). The VoiceChannel uses this together with bind_session to implement the dual-signal ON_SESSION_STARTED hook.

Parameters:

Name Type Description Default
callback SessionReadyCallback

Function called with (session).

required

on_barge_in

on_barge_in(callback)

Register callback for barge-in detection.

Only called if capabilities includes BARGE_IN. Backends should call this when user starts speaking while audio is being played (TTS interruption).

Parameters:

Name Type Description Default
callback BargeInCallback

Function called with (session).

required

cancel_audio async

cancel_audio(session)

Cancel ongoing audio playback for a session.

Delegates to :meth:interrupt and returns True. Subclasses may override for more nuanced behaviour.

Parameters:

Name Type Description Default
session VoiceSession

The session to cancel audio for.

required

Returns:

Type Description
bool

True if audio was cancelled, False if nothing was playing.

accept async

accept(session, connection)

Bind an external connection to a session.

Backends that receive connections from external sources (e.g. WebSocket, WebRTC, SIP) override this. Backends that create their own connections override :meth:connect instead.

Parameters:

Name Type Description Default
session VoiceSession

The voice session to bind.

required
connection Any

Protocol-specific connection object.

required

interrupt

interrupt(session)

Signal interruption — flush outbound queue, stop playback.

set_input_muted

set_input_muted(session, muted)

Mute or unmute the input (microphone) for a session.

Parameters:

Name Type Description Default
session VoiceSession

The session to mute/unmute.

required
muted bool

True to mute, False to unmute.

required

set_input_gated

set_input_gated(session, gated)

Gate or un-gate audio input for primary speaker mode.

When gated, audio is not forwarded to provider callbacks but may still be fed to a pipeline for diarization analysis.

Parameters:

Name Type Description Default
session VoiceSession

The session to gate/un-gate.

required
gated bool

True to gate, False to un-gate.

required

on_client_disconnected

on_client_disconnected(callback)

Register callback for client disconnection.

Parameters:

Name Type Description Default
callback TransportDisconnectCallback

Called with (session) when the client disconnects.

required

on_speaker_change

on_speaker_change(callback)

Register callback for speaker change events.

Parameters:

Name Type Description Default
callback SpeakerChangeCallback

Called with (session, diarization_result).

required

end_of_response

end_of_response(session)

Signal end of an AI response for outbound pacing.

is_playing

is_playing(session)

Check if audio is currently being sent to the session.

Used for barge-in detection to know if interruption is possible.

Parameters:

Name Type Description Default
session VoiceSession

The session to check.

required

Returns:

Type Description
bool

True if audio is currently playing, False otherwise.

on_audio_played

on_audio_played(callback)

Register a callback for audio frames as they are played.

Called with each audio frame at the moment it is output by the speaker, providing time-aligned reference for echo cancellation. The pipeline uses this to feed AEC reference at the correct time.

Note

Callbacks may be invoked from the audio I/O thread — implementations must be thread-safe.

Parameters:

Name Type Description Default
callback AudioPlayedCallback

Function called with (session, audio_frame).

required

set_trace_emitter

set_trace_emitter(emitter)

Set a callback for emitting protocol traces.

Called by the owning channel when trace observers are registered. Implementations should store the emitter and call it at key protocol points (e.g. INVITE, BYE for SIP).

Parameters:

Name Type Description Default
emitter Callable[..., Any] | None

The channel's :meth:emit_trace method, or None to disable.

required

send_transcription async

send_transcription(session, text, role='user')

Send transcription text to the client for UI display.

Optional method for backends that support sending text updates.

Parameters:

Name Type Description Default
session VoiceSession

The voice session to send to.

required
text str

The transcribed or response text.

required
role str

Either "user" (transcription) or "assistant" (AI response).

'user'

VoiceCapability

Bases: Flag

Capabilities a VoiceBackend can support.

Backends declare their capabilities via the capabilities property. This allows RoomKit to know which features are available and enables integrators to choose backends based on their needs.

Example

class MyBackend(VoiceBackend): @property def capabilities(self) -> VoiceCapability: return ( VoiceCapability.INTERRUPTION | VoiceCapability.BARGE_IN )

NONE class-attribute instance-attribute

NONE = 0

No optional capabilities (default).

INTERRUPTION class-attribute instance-attribute

INTERRUPTION = auto()

Backend can cancel ongoing audio playback (cancel_audio).

BARGE_IN class-attribute instance-attribute

BARGE_IN = auto()

Backend detects and handles barge-in (user interrupts TTS).

NATIVE_AEC class-attribute instance-attribute

NATIVE_AEC = auto()

Backend provides its own Acoustic Echo Cancellation.

NATIVE_AGC class-attribute instance-attribute

NATIVE_AGC = auto()

Backend provides its own Automatic Gain Control.

DTMF_INBAND class-attribute instance-attribute

DTMF_INBAND = auto()

Backend can detect DTMF tones from the audio stream.

DTMF_SIGNALING class-attribute instance-attribute

DTMF_SIGNALING = auto()

Backend receives DTMF via out-of-band signaling (e.g. SIP INFO).

VoiceSession dataclass

VoiceSession(id, room_id, participant_id, channel_id, state=CONNECTING, provider_session_id=None, created_at=_utcnow(), metadata=dict())

Active voice connection for a participant.

VoiceSessionState

Bases: StrEnum

State of a voice session.

AudioChunk dataclass

AudioChunk(data, sample_rate=16000, channels=1, format='pcm_s16le', timestamp_ms=None, is_final=False)

A chunk of audio data for streaming (used for outbound TTS).

TranscriptionResult dataclass

TranscriptionResult(text, is_final=True, confidence=None, language=None, words=list(), is_speech_start=False)

Result from speech-to-text transcription.

is_speech_start class-attribute instance-attribute

is_speech_start = False

Set by providers with server-side VAD to signal speech detected.

STT (Speech-to-Text)

STTProvider

Bases: ABC

Speech-to-text provider.

name property

name

Provider name (e.g. 'whisper', 'deepgram').

supports_streaming property

supports_streaming

Whether this provider supports streaming transcription.

transcribe abstractmethod async

transcribe(audio)

Transcribe complete audio to text.

Parameters:

Name Type Description Default
audio AudioContent | AudioChunk | AudioFrame

Audio content (URL), raw audio chunk, or audio frame.

required

Returns:

Type Description
TranscriptionResult

TranscriptionResult with text and metadata.

transcribe_stream async

transcribe_stream(audio_stream)

Stream transcription with partial results.

Override for providers that support streaming. Default: buffers all audio and returns single result.

warmup async

warmup()

Pre-load models so the first call is fast. Override in subclasses.

close async

close()

Release resources. Override in subclasses if needed.

MockSTTProvider

MockSTTProvider(transcripts=None, *, streaming=False)

Bases: STTProvider

Mock speech-to-text for testing.

Sherpa-ONNX STT

SherpaOnnxSTTProvider

SherpaOnnxSTTProvider(config)

Bases: STTProvider

Speech-to-text provider using sherpa-onnx.

Supports transducer models for both streaming and batch recognition, and Whisper models for batch recognition only.

warmup async

warmup()

Pre-load the recognizer model (CUDA init can be slow).

transcribe async

transcribe(audio)

Transcribe complete audio.

For transducer mode, uses the OnlineRecognizer (feeds all audio then reads the result). For whisper mode, uses the OfflineRecognizer.

Parameters:

Name Type Description Default
audio AudioContent | AudioChunk | AudioFrame

Audio content or raw audio chunk (PCM S16LE expected).

required

Returns:

Type Description
TranscriptionResult

TranscriptionResult with text.

transcribe_stream async

transcribe_stream(audio_stream)

Stream transcription with partial results using OnlineRecognizer.

Only supported for transducer mode. Whisper mode raises ValueError.

Parameters:

Name Type Description Default
audio_stream AsyncIterator[AudioChunk]

Async iterator of audio chunks.

required

Yields:

Type Description
AsyncIterator[TranscriptionResult]

TranscriptionResult with partial and final transcripts.

SherpaOnnxSTTConfig dataclass

SherpaOnnxSTTConfig(mode='transducer', tokens='', encoder='', decoder='', joiner='', model_type='', language='en', task='transcribe', sample_rate=16000, num_threads=2, provider='cpu', enable_endpoint_detection=True, rule1_min_trailing_silence=2.4, rule2_min_trailing_silence=1.2, rule3_min_utterance_length=20.0)

Configuration for the sherpa-onnx STT provider.

Attributes:

Name Type Description
mode str

Recognition mode — "transducer" or "whisper".

tokens str

Path to tokens.txt.

encoder str

Path to encoder .onnx model.

decoder str

Path to decoder .onnx model.

joiner str

Path to joiner .onnx model (transducer only).

model_type str

Model type hint for sherpa-onnx (e.g. "nemo_transducer" for NeMo TDT/transducer models). When set, the model is treated as offline-only (no streaming).

language str

Language code (Whisper only).

task str

Whisper task — "transcribe" (default) or "translate" (translates to English).

sample_rate int

Expected audio sample rate.

num_threads int

Number of CPU threads for inference.

provider str

ONNX execution provider ("cpu" or "cuda").

enable_endpoint_detection bool

Enable sherpa-onnx endpoint detection. Enabled by default. When VAD drives the stream lifecycle the VAD fires first (its silence threshold is shorter), so this is harmless in a pipeline and useful for standalone use.

rule1_min_trailing_silence float

Endpoint rule 1 — minimum trailing silence (seconds) after speech to trigger endpoint.

rule2_min_trailing_silence float

Endpoint rule 2 — minimum trailing silence (seconds) after speech with decoded text.

rule3_min_utterance_length float

Endpoint rule 3 — minimum utterance length (seconds) to trigger endpoint regardless of silence.

Usage

from roomkit.voice.stt.sherpa_onnx import SherpaOnnxSTTProvider, SherpaOnnxSTTConfig

# Transducer model (streaming + batch)
stt = SherpaOnnxSTTProvider(SherpaOnnxSTTConfig(
    mode="transducer",
    tokens="path/to/tokens.txt",
    encoder="path/to/encoder.onnx",
    decoder="path/to/decoder.onnx",
    joiner="path/to/joiner.onnx",
))

# Whisper model (batch only)
stt = SherpaOnnxSTTProvider(SherpaOnnxSTTConfig(
    mode="whisper",
    tokens="path/to/tokens.txt",
    encoder="path/to/encoder.onnx",
    decoder="path/to/decoder.onnx",
    language="en",
))

Install with: pip install roomkit[sherpa-onnx]

TTS (Text-to-Speech)

TTSProvider

Bases: ABC

Text-to-speech provider.

name property

name

Provider name (e.g. 'elevenlabs', 'openai').

default_voice property

default_voice

Default voice ID. Override in subclasses.

supports_streaming_input property

supports_streaming_input

Whether this TTS accepts streaming text input.

synthesize abstractmethod async

synthesize(text, *, voice=None)

Synthesize text to audio.

Parameters:

Name Type Description Default
text str

Text to synthesize.

required
voice str | None

Voice ID (uses default_voice if not specified).

None

Returns:

Type Description
AudioContent

AudioContent with URL to generated audio.

synthesize_stream_input async

synthesize_stream_input(text_stream, *, voice=None)

Stream audio from streaming text chunks.

Override for providers that accept an async text stream as input.

synthesize_stream async

synthesize_stream(text, *, voice=None)

Stream audio chunks as they're generated.

Override for providers that support streaming. Default: synthesizes full audio and yields single chunk.

warmup async

warmup()

Pre-load models so the first call is fast. Override in subclasses.

close async

close()

Release resources. Override in subclasses if needed.

MockTTSProvider

MockTTSProvider(voice='mock-voice')

Bases: TTSProvider

Mock text-to-speech for testing.

Sherpa-ONNX TTS

SherpaOnnxTTSProvider

SherpaOnnxTTSProvider(config)

Bases: TTSProvider

Text-to-speech provider using sherpa-onnx with VITS/Piper models.

warmup async

warmup()

Pre-load the TTS model (CUDA init can be slow).

synthesize async

synthesize(text, *, voice=None)

Synthesize text to audio.

Parameters:

Name Type Description Default
text str

Text to synthesize.

required
voice str | None

Speaker ID as string (uses default if not specified).

None

Returns:

Type Description
AudioContent

AudioContent with a data: URL containing WAV audio.

synthesize_stream async

synthesize_stream(text, *, voice=None)

Stream audio chunks using a callback bridge.

Parameters:

Name Type Description Default
text str

Text to synthesize.

required
voice str | None

Speaker ID as string (uses default if not specified).

None

Yields:

Type Description
AsyncIterator[AudioChunk]

AudioChunk with PCM S16LE audio data.

SherpaOnnxTTSConfig dataclass

SherpaOnnxTTSConfig(model='', tokens='', data_dir='', lexicon='', speaker_id=0, speed=1.0, sample_rate=22050, num_threads=2, provider='cpu')

Configuration for the sherpa-onnx TTS provider.

Attributes:

Name Type Description
model str

Path to VITS/Piper .onnx model.

tokens str

Path to tokens.txt.

data_dir str

Path to espeak-ng data directory (Piper models).

lexicon str

Path to optional lexicon file.

speaker_id int

Speaker ID for multi-speaker models.

speed float

Speech speed multiplier (1.0 = normal).

sample_rate int

Output sample rate (usually determined by the model).

num_threads int

Number of CPU threads for inference.

provider str

ONNX execution provider ("cpu" or "cuda").

Usage

from roomkit.voice.tts.sherpa_onnx import SherpaOnnxTTSProvider, SherpaOnnxTTSConfig

# VITS/Piper model with multi-speaker support
tts = SherpaOnnxTTSProvider(SherpaOnnxTTSConfig(
    model="path/to/vits-model.onnx",
    tokens="path/to/tokens.txt",
    data_dir="path/to/espeak-ng-data",  # for Piper models
    speaker_id=0,
    speed=1.0,
))

Install with: pip install roomkit[sherpa-onnx]

RTP Backend

RTPVoiceBackend

RTPVoiceBackend(*, local_addr=('0.0.0.0', 0), remote_addr=None, payload_type=0, clock_rate=8000, dtmf_payload_type=101, rtcp_interval=5.0)

Bases: VoiceBackend

VoiceBackend that sends and receives audio over RTP.

Each :meth:connect call creates a new aiortp.RTPSession bound to the configured local address and sending to the remote address.

Inbound audio is decoded by aiortp (G.711, L16, Opus) and delivered as AudioFrame objects via the on_audio_received callback. DTMF digits are received out-of-band via RFC 4733 and delivered via on_dtmf_received callbacks.

Parameters:

Name Type Description Default
local_addr tuple[str, int]

(host, port) to bind RTP. Use port 0 for OS-assigned.

('0.0.0.0', 0)
remote_addr tuple[str, int] | None

(host, port) to send RTP to. May be None if supplied per-session via metadata["remote_addr"] in :meth:connect.

None
payload_type int

RTP payload type number (default 0 = PCMU).

0
clock_rate int

Clock rate in Hz (default 8000 for G.711).

8000
dtmf_payload_type int

RTP payload type for RFC 4733 DTMF events (default 101).

101
rtcp_interval float

Seconds between RTCP sender reports.

5.0

send_transcription async

send_transcription(session, text, role='user')

Log transcription text (no UI channel in RTP mode).

on_dtmf_received

on_dtmf_received(callback)

Register a callback for inbound DTMF digits (RFC 4733).

Parameters:

Name Type Description Default
callback DTMFReceivedCallback

Function called with (session, dtmf_event).

required

Install with: pip install roomkit[rtp]

Mock Voice Backend

MockVoiceBackend

MockVoiceBackend(*, capabilities=NONE)

Bases: VoiceBackend

Mock voice backend for testing.

Tracks all method calls and provides helpers to simulate events. The backend is a pure transport — no VAD or audio intelligence.

Example

backend = MockVoiceBackend()

Track calls

session = await backend.connect("room-1", "user-1", "voice-1") assert backend.calls[-1].method == "connect"

Simulate raw audio received

frame = AudioFrame(data=b"audio-data") await backend.simulate_audio_received(session, frame)

Simulate barge-in

await backend.simulate_barge_in(session)

simulate_audio_received async

simulate_audio_received(session, frame)

Simulate the backend receiving a raw audio frame.

Fires all registered on_audio_received callbacks.

simulate_barge_in async

simulate_barge_in(session)

Simulate user speaking while TTS is playing (barge-in).

Fires all registered on_barge_in callbacks.

start_playing

start_playing(session)

Mark a session as playing audio (for barge-in testing).

stop_playing

stop_playing(session)

Mark a session as no longer playing audio.

simulate_session_ready async

simulate_session_ready(session)

Simulate the backend signalling that a session's audio path is live.

Fires all registered on_session_ready callbacks.

simulate_client_disconnected async

simulate_client_disconnected(session)

Simulate the backend signalling that a client has disconnected.

Fires all registered on_client_disconnected callbacks.

MockVoiceCall dataclass

MockVoiceCall(method, args=dict())

Record of a call made to MockVoiceBackend.

Voice Events

BargeInEvent dataclass

BargeInEvent(session, interrupted_text, audio_position_ms, timestamp=_utcnow())

User started speaking while TTS was playing.

This event is fired when the VAD detects speech starting while audio is being sent to the user. This allows the system to: - Cancel the current TTS playback - Adjust response strategy (e.g., acknowledge interruption) - Track conversation dynamics

session instance-attribute

session

The voice session where barge-in occurred.

interrupted_text instance-attribute

interrupted_text

The text that was being spoken when interrupted.

audio_position_ms instance-attribute

audio_position_ms

How far into the TTS audio playback (in milliseconds).

timestamp class-attribute instance-attribute

timestamp = field(default_factory=_utcnow)

When the barge-in was detected.

TTSCancelledEvent dataclass

TTSCancelledEvent(session, reason, text, audio_position_ms, timestamp=_utcnow())

TTS playback was cancelled.

This event is fired when TTS synthesis or playback is stopped before completion. Reasons include: - barge_in: User started speaking - explicit: Application called interrupt() - disconnect: Session ended - error: TTS or playback error

session instance-attribute

session

The voice session where TTS was cancelled.

reason instance-attribute

reason

Why the TTS was cancelled.

text instance-attribute

text

The text that was being synthesized.

audio_position_ms instance-attribute

audio_position_ms

How far into playback (0 if not started).

timestamp class-attribute instance-attribute

timestamp = field(default_factory=_utcnow)

When the cancellation occurred.

PartialTranscriptionEvent dataclass

PartialTranscriptionEvent(session, text, confidence, is_stable, timestamp=_utcnow())

Interim transcription result during speech.

This event is fired by backends that support streaming STT, providing real-time transcription updates before the final result. Use cases include: - Live captions/subtitles - Early intent detection - Visual feedback during speech

session instance-attribute

session

The voice session being transcribed.

text instance-attribute

text

The current transcription (may change in subsequent events).

confidence instance-attribute

confidence

Confidence score (0.0 to 1.0).

is_stable instance-attribute

is_stable

True if this portion is unlikely to change significantly.

timestamp class-attribute instance-attribute

timestamp = field(default_factory=_utcnow)

When this transcription was received.

VADSilenceEvent dataclass

VADSilenceEvent(session, silence_duration_ms, timestamp=_utcnow())

Silence detected after speech.

This event is fired when the VAD detects a period of silence following speech. It can be used for: - Early end-of-utterance detection (before full speech_end) - Adaptive silence thresholds - Turn-taking management

session instance-attribute

session

The voice session where silence was detected.

silence_duration_ms instance-attribute

silence_duration_ms

Duration of silence in milliseconds.

timestamp class-attribute instance-attribute

timestamp = field(default_factory=_utcnow)

When the silence was detected.

VADAudioLevelEvent dataclass

VADAudioLevelEvent(session, level_db, is_speech, timestamp=_utcnow())

Periodic audio level update for UI feedback.

This event is fired periodically (typically 10Hz) to provide audio level information for UI visualization. Use cases include: - Audio level meters - Speaking indicators - Noise detection

session instance-attribute

session

The voice session.

level_db instance-attribute

level_db

Audio level in dB (typically -60 to 0, where 0 is max).

is_speech instance-attribute

is_speech

VAD's determination if this audio contains speech.

timestamp class-attribute instance-attribute

timestamp = field(default_factory=_utcnow)

When this measurement was taken.

Callback Types

Callback Signature
SpeechStartCallback (VoiceSession) -> Any
SpeechEndCallback (VoiceSession, bytes) -> Any
PartialTranscriptionCallback (VoiceSession, str, float, bool) -> Any
VADSilenceCallback (VoiceSession, int) -> Any
VADAudioLevelCallback (VoiceSession, float, bool) -> Any
BargeInCallback (VoiceSession) -> Any