Voice Providers¶

Voice Backend¶

VoiceBackend ¶

Bases: ABC

Abstract base class for voice transport backends.

VoiceBackend handles the transport layer for real-time audio: - Managing voice session connections - Streaming audio to/from clients - Delivering raw inbound audio frames via on_audio_received

The backend is framework-agnostic and a pure transport — all audio intelligence (VAD, denoising, diarization) is handled by the AudioPipeline.

Example usage

backend = WebRTCVoiceBackend()

Register raw audio callback¶

backend.on_audio_received(handle_audio_frame)

Connect a participant¶

session = await backend.connect("room-1", "user-1", "voice-channel")

Stream audio to the client¶

await backend.send_audio(session, audio_chunks)

Disconnect¶

await backend.disconnect(session)

name `abstractmethod` `property` ¶

name

Backend name (e.g., 'webrtc', 'websocket', 'livekit').

capabilities `property` ¶

capabilities

Declare supported capabilities.

Override to enable features like interruption, barge-in, etc. By default, no optional capabilities are supported.

Returns:

Type	Description
`VoiceCapability`	Flags indicating supported capabilities.

feeds_aec_reference `property` ¶

feeds_aec_reference

Whether this backend feeds AEC reference at the transport level.

When True, the pipeline skips aec.feed_reference() in the outbound path to avoid double-feeding. Transport-level feeding (from the speaker callback) is preferred because it is time-aligned with actual speaker output.

supports_playback_callback `property` ¶

supports_playback_callback

Whether this backend fires :meth:on_audio_played callbacks.

When True, the pipeline can rely on playback-time AEC reference instead of generation-time feeding from process_outbound.

connect `async` ¶

connect(room_id, participant_id, channel_id, *, metadata=None)

Create a new voice session for a participant.

Backends that initiate connections (VoiceChannel path) override this. Backends that receive external connections (realtime transport path) override :meth:accept instead.

Parameters:

Name	Type	Description	Default
`room_id`	`str`	The room to join.	required
`participant_id`	`str`	The participant's ID.	required
`channel_id`	`str`	The voice channel ID.	required
`metadata`	`dict[str, Any] \| None`	Optional session metadata.	`None`

Returns:

Type	Description
`VoiceSession`	A VoiceSession representing the connection.

disconnect `abstractmethod` `async` ¶

disconnect(session)

End a voice session.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The session to disconnect.	required

send_audio `abstractmethod` `async` ¶

send_audio(session, audio)

Send audio to a voice session.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The target session.	required
`audio`	`bytes \| AsyncIterator[AudioChunk]`	Raw audio bytes or an async iterator of AudioChunks for streaming.	required

get_session ¶

get_session(session_id)

Get a session by ID.

Override for backends that track sessions internally.

Parameters:

Name	Type	Description	Default
`session_id`	`str`	The session ID to look up.	required

Returns:

Type	Description
`VoiceSession \| None`	The VoiceSession if found, None otherwise.

list_sessions ¶

list_sessions(room_id)

List all active sessions in a room.

Override for backends that track sessions internally.

Parameters:

Name	Type	Description	Default
`room_id`	`str`	The room to list sessions for.	required

Returns:

Type	Description
`list[VoiceSession]`	List of active VoiceSessions in the room.

close `async` ¶

close()

Release backend resources.

Override in subclasses that need cleanup.

on_audio_received ¶

on_audio_received(callback)

Register a callback for raw inbound audio frames.

The pipeline or channel calls this to receive every audio frame produced by the transport.

Parameters:

Name	Type	Description	Default
`callback`	`AudioReceivedCallback`	Function called with (session, audio_frame).	required

on_session_ready ¶

on_session_ready(callback)

Register callback for when a session's audio path becomes live.

Fired when the transport is ready to send/receive audio for a session (e.g. WebSocket connected, RTP socket active, mic stream started). The VoiceChannel uses this together with bind_session to implement the dual-signal ON_SESSION_STARTED hook.

Parameters:

Name	Type	Description	Default
`callback`	`SessionReadyCallback`	Function called with `(session)`.	required

on_barge_in ¶

on_barge_in(callback)

Register callback for barge-in detection.

Only called if capabilities includes BARGE_IN. Backends should call this when user starts speaking while audio is being played (TTS interruption).

Parameters:

Name	Type	Description	Default
`callback`	`BargeInCallback`	Function called with (session).	required

cancel_audio `async` ¶

cancel_audio(session)

Cancel ongoing audio playback for a session.

Delegates to :meth:interrupt and returns True. Subclasses may override for more nuanced behaviour.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The session to cancel audio for.	required

Returns:

Type	Description
`bool`	True if audio was cancelled, False if nothing was playing.

accept `async` ¶

accept(session, connection)

Bind an external connection to a session.

Backends that receive connections from external sources (e.g. WebSocket, WebRTC, SIP) override this. Backends that create their own connections override :meth:connect instead.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The voice session to bind.	required
`connection`	`Any`	Protocol-specific connection object.	required

interrupt ¶

interrupt(session)

Signal interruption — flush outbound queue, stop playback.

set_input_muted ¶

set_input_muted(session, muted)

Mute or unmute the input (microphone) for a session.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The session to mute/unmute.	required
`muted`	`bool`	`True` to mute, `False` to unmute.	required

set_input_gated ¶

set_input_gated(session, gated)

Gate or un-gate audio input for primary speaker mode.

When gated, audio is not forwarded to provider callbacks but may still be fed to a pipeline for diarization analysis.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The session to gate/un-gate.	required
`gated`	`bool`	`True` to gate, `False` to un-gate.	required

on_client_disconnected ¶

on_client_disconnected(callback)

Register callback for client disconnection.

Parameters:

Name	Type	Description	Default
`callback`	`TransportDisconnectCallback`	Called with (session) when the client disconnects.	required

on_speaker_change ¶

on_speaker_change(callback)

Register callback for speaker change events.

Parameters:

Name	Type	Description	Default
`callback`	`SpeakerChangeCallback`	Called with (session, diarization_result).	required

end_of_response ¶

end_of_response(session)

Signal end of an AI response for outbound pacing.

is_playing ¶

is_playing(session)

Check if audio is currently being sent to the session.

Used for barge-in detection to know if interruption is possible.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The session to check.	required

Returns:

Type	Description
`bool`	True if audio is currently playing, False otherwise.

on_audio_played ¶

on_audio_played(callback)

Register a callback for audio frames as they are played.

Called with each audio frame at the moment it is output by the speaker, providing time-aligned reference for echo cancellation. The pipeline uses this to feed AEC reference at the correct time.

Note

Callbacks may be invoked from the audio I/O thread — implementations must be thread-safe.

Parameters:

Name	Type	Description	Default
`callback`	`AudioPlayedCallback`	Function called with (session, audio_frame).	required

set_trace_emitter ¶

set_trace_emitter(emitter)

Set a callback for emitting protocol traces.

Called by the owning channel when trace observers are registered. Implementations should store the emitter and call it at key protocol points (e.g. INVITE, BYE for SIP).

Parameters:

Name	Type	Description	Default
`emitter`	`Callable[..., Any] \| None`	The channel's :meth:`emit_trace` method, or `None` to disable.	required

send_transcription `async` ¶

send_transcription(session, text, role='user')

Send transcription text to the client for UI display.

Optional method for backends that support sending text updates.

Parameters:

Name	Type	Description	Default
`session`	`VoiceSession`	The voice session to send to.	required
`text`	`str`	The transcribed or response text.	required
`role`	`str`	Either "user" (transcription) or "assistant" (AI response).	`'user'`

VoiceCapability ¶

Bases: Flag

Capabilities a VoiceBackend can support.

Backends declare their capabilities via the capabilities property. This allows RoomKit to know which features are available and enables integrators to choose backends based on their needs.

Example

class MyBackend(VoiceBackend): @property def capabilities(self) -> VoiceCapability: return ( VoiceCapability.INTERRUPTION | VoiceCapability.BARGE_IN )

NONE `class-attribute` `instance-attribute` ¶

NONE = 0

No optional capabilities (default).

INTERRUPTION `class-attribute` `instance-attribute` ¶

INTERRUPTION = auto()

Backend can cancel ongoing audio playback (cancel_audio).

BARGE_IN `class-attribute` `instance-attribute` ¶

BARGE_IN = auto()

Backend detects and handles barge-in (user interrupts TTS).

NATIVE_AEC `class-attribute` `instance-attribute` ¶

NATIVE_AEC = auto()

Backend provides its own Acoustic Echo Cancellation.

NATIVE_AGC `class-attribute` `instance-attribute` ¶

NATIVE_AGC = auto()

Backend provides its own Automatic Gain Control.

DTMF_INBAND `class-attribute` `instance-attribute` ¶

DTMF_INBAND = auto()

Backend can detect DTMF tones from the audio stream.

DTMF_SIGNALING `class-attribute` `instance-attribute` ¶

DTMF_SIGNALING = auto()

Backend receives DTMF via out-of-band signaling (e.g. SIP INFO).

VoiceSession `dataclass` ¶

VoiceSession(id, room_id, participant_id, channel_id, state=CONNECTING, provider_session_id=None, created_at=_utcnow(), metadata=dict())

Active voice connection for a participant.

VoiceSessionState ¶

Bases: StrEnum

State of a voice session.

AudioChunk `dataclass` ¶

AudioChunk(data, sample_rate=16000, channels=1, format='pcm_s16le', timestamp_ms=None, is_final=False)

A chunk of audio data for streaming (used for outbound TTS).

TranscriptionResult `dataclass` ¶

TranscriptionResult(text, is_final=True, confidence=None, language=None, words=list(), is_speech_start=False)

Result from speech-to-text transcription.

is_speech_start `class-attribute` `instance-attribute` ¶

is_speech_start = False

Set by providers with server-side VAD to signal speech detected.

STT (Speech-to-Text)¶

STTProvider ¶

Bases: ABC

Speech-to-text provider.

name `property` ¶

name

Provider name (e.g. 'whisper', 'deepgram').

supports_streaming `property` ¶

supports_streaming

Whether this provider supports streaming transcription.

transcribe `abstractmethod` `async` ¶

transcribe(audio)

Transcribe complete audio to text.

Parameters:

Name	Type	Description	Default
`audio`	`AudioContent \| AudioChunk \| AudioFrame`	Audio content (URL), raw audio chunk, or audio frame.	required

Returns:

Type	Description
`TranscriptionResult`	TranscriptionResult with text and metadata.

transcribe_stream `async` ¶

transcribe_stream(audio_stream)

Stream transcription with partial results.

Override for providers that support streaming. Default: buffers all audio and returns single result.

warmup `async` ¶

warmup()

Pre-load models so the first call is fast. Override in subclasses.

close `async` ¶

close()

Release resources. Override in subclasses if needed.

MockSTTProvider ¶

MockSTTProvider(transcripts=None, *, streaming=False)

Bases: STTProvider

Mock speech-to-text for testing.

Sherpa-ONNX STT¶

SherpaOnnxSTTProvider ¶

SherpaOnnxSTTProvider(config)

Bases: STTProvider

Speech-to-text provider using sherpa-onnx.

Supports transducer models for both streaming and batch recognition, and Whisper models for batch recognition only.

warmup `async` ¶

warmup()

Pre-load the recognizer model (CUDA init can be slow).

transcribe `async` ¶

transcribe(audio)

Transcribe complete audio.

For transducer mode, uses the OnlineRecognizer (feeds all audio then reads the result). For whisper mode, uses the OfflineRecognizer.

Parameters:

Name	Type	Description	Default
`audio`	`AudioContent \| AudioChunk \| AudioFrame`	Audio content or raw audio chunk (PCM S16LE expected).	required

Returns:

Type	Description
`TranscriptionResult`	TranscriptionResult with text.

transcribe_stream `async` ¶

transcribe_stream(audio_stream)

Stream transcription with partial results using OnlineRecognizer.

Only supported for transducer mode. Whisper mode raises ValueError.

Parameters:

Name	Type	Description	Default
`audio_stream`	`AsyncIterator[AudioChunk]`	Async iterator of audio chunks.	required

Yields:

Type	Description
`AsyncIterator[TranscriptionResult]`	TranscriptionResult with partial and final transcripts.

SherpaOnnxSTTConfig `dataclass` ¶

SherpaOnnxSTTConfig(mode='transducer', tokens='', encoder='', decoder='', joiner='', model_type='', language='en', task='transcribe', sample_rate=16000, num_threads=2, provider='cpu', enable_endpoint_detection=True, rule1_min_trailing_silence=2.4, rule2_min_trailing_silence=1.2, rule3_min_utterance_length=20.0)

Configuration for the sherpa-onnx STT provider.

Attributes:

Name	Type	Description
`mode`	`str`	Recognition mode — `"transducer"` or `"whisper"`.
`tokens`	`str`	Path to `tokens.txt`.
`encoder`	`str`	Path to encoder `.onnx` model.
`decoder`	`str`	Path to decoder `.onnx` model.
`joiner`	`str`	Path to joiner `.onnx` model (transducer only).
`model_type`	`str`	Model type hint for sherpa-onnx (e.g. `"nemo_transducer"` for NeMo TDT/transducer models). When set, the model is treated as offline-only (no streaming).
`language`	`str`	Language code (Whisper only).
`task`	`str`	Whisper task — `"transcribe"` (default) or `"translate"` (translates to English).
`sample_rate`	`int`	Expected audio sample rate.
`num_threads`	`int`	Number of CPU threads for inference.
`provider`	`str`	ONNX execution provider (`"cpu"` or `"cuda"`).
`enable_endpoint_detection`	`bool`	Enable sherpa-onnx endpoint detection. Enabled by default. When VAD drives the stream lifecycle the VAD fires first (its silence threshold is shorter), so this is harmless in a pipeline and useful for standalone use.
`rule1_min_trailing_silence`	`float`	Endpoint rule 1 — minimum trailing silence (seconds) after speech to trigger endpoint.
`rule2_min_trailing_silence`	`float`	Endpoint rule 2 — minimum trailing silence (seconds) after speech with decoded text.
`rule3_min_utterance_length`	`float`	Endpoint rule 3 — minimum utterance length (seconds) to trigger endpoint regardless of silence.

Usage¶

from roomkit.voice.stt.sherpa_onnx import SherpaOnnxSTTProvider, SherpaOnnxSTTConfig

# Transducer model (streaming + batch)
stt = SherpaOnnxSTTProvider(SherpaOnnxSTTConfig(
    mode="transducer",
    tokens="path/to/tokens.txt",
    encoder="path/to/encoder.onnx",
    decoder="path/to/decoder.onnx",
    joiner="path/to/joiner.onnx",
))

# Whisper model (batch only)
stt = SherpaOnnxSTTProvider(SherpaOnnxSTTConfig(
    mode="whisper",
    tokens="path/to/tokens.txt",
    encoder="path/to/encoder.onnx",
    decoder="path/to/decoder.onnx",
    language="en",
))

Install with: pip install roomkit[sherpa-onnx]

TTS (Text-to-Speech)¶

TTSProvider ¶

Bases: ABC

Text-to-speech provider.

name `property` ¶

name

Provider name (e.g. 'elevenlabs', 'openai').

default_voice `property` ¶

default_voice

Default voice ID. Override in subclasses.

supports_streaming_input `property` ¶

supports_streaming_input

Whether this TTS accepts streaming text input.

synthesize `abstractmethod` `async` ¶

synthesize(text, *, voice=None)

Synthesize text to audio.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text to synthesize.	required
`voice`	`str \| None`	Voice ID (uses default_voice if not specified).	`None`

Returns:

Type	Description
`AudioContent`	AudioContent with URL to generated audio.

synthesize_stream_input `async` ¶

synthesize_stream_input(text_stream, *, voice=None)

Stream audio from streaming text chunks.

Override for providers that accept an async text stream as input.

synthesize_stream `async` ¶

synthesize_stream(text, *, voice=None)

Stream audio chunks as they're generated.

Override for providers that support streaming. Default: synthesizes full audio and yields single chunk.

warmup `async` ¶

warmup()

Pre-load models so the first call is fast. Override in subclasses.

close `async` ¶

close()

Release resources. Override in subclasses if needed.

MockTTSProvider ¶

MockTTSProvider(voice='mock-voice')

Bases: TTSProvider

Mock text-to-speech for testing.

Sherpa-ONNX TTS¶

SherpaOnnxTTSProvider ¶

SherpaOnnxTTSProvider(config)

Bases: TTSProvider

Text-to-speech provider using sherpa-onnx with VITS/Piper models.

warmup `async` ¶

warmup()

Pre-load the TTS model (CUDA init can be slow).

synthesize `async` ¶

synthesize(text, *, voice=None)

Synthesize text to audio.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text to synthesize.	required
`voice`	`str \| None`	Speaker ID as string (uses default if not specified).	`None`

Returns:

Type	Description
`AudioContent`	AudioContent with a `data:` URL containing WAV audio.

synthesize_stream `async` ¶

synthesize_stream(text, *, voice=None)

Stream audio chunks using a callback bridge.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text to synthesize.	required
`voice`	`str \| None`	Speaker ID as string (uses default if not specified).	`None`

Yields:

Type	Description
`AsyncIterator[AudioChunk]`	AudioChunk with PCM S16LE audio data.

SherpaOnnxTTSConfig `dataclass` ¶

SherpaOnnxTTSConfig(model='', tokens='', data_dir='', lexicon='', speaker_id=0, speed=1.0, sample_rate=22050, num_threads=2, provider='cpu')

Configuration for the sherpa-onnx TTS provider.

Attributes:

Name	Type	Description
`model`	`str`	Path to VITS/Piper `.onnx` model.
`tokens`	`str`	Path to `tokens.txt`.
`data_dir`	`str`	Path to espeak-ng data directory (Piper models).
`lexicon`	`str`	Path to optional lexicon file.
`speaker_id`	`int`	Speaker ID for multi-speaker models.
`speed`	`float`	Speech speed multiplier (1.0 = normal).
`sample_rate`	`int`	Output sample rate (usually determined by the model).
`num_threads`	`int`	Number of CPU threads for inference.
`provider`	`str`	ONNX execution provider (`"cpu"` or `"cuda"`).

Usage¶

from roomkit.voice.tts.sherpa_onnx import SherpaOnnxTTSProvider, SherpaOnnxTTSConfig

# VITS/Piper model with multi-speaker support
tts = SherpaOnnxTTSProvider(SherpaOnnxTTSConfig(
    model="path/to/vits-model.onnx",
    tokens="path/to/tokens.txt",
    data_dir="path/to/espeak-ng-data",  # for Piper models
    speaker_id=0,
    speed=1.0,
))

Install with: pip install roomkit[sherpa-onnx]

RTP Backend¶

RTPVoiceBackend ¶

RTPVoiceBackend(*, local_addr=('0.0.0.0', 0), remote_addr=None, payload_type=0, clock_rate=8000, dtmf_payload_type=101, rtcp_interval=5.0)

Bases: VoiceBackend

VoiceBackend that sends and receives audio over RTP.

Each :meth:connect call creates a new aiortp.RTPSession bound to the configured local address and sending to the remote address.

Inbound audio is decoded by aiortp (G.711, L16, Opus) and delivered as AudioFrame objects via the on_audio_received callback. DTMF digits are received out-of-band via RFC 4733 and delivered via on_dtmf_received callbacks.

Parameters:

Name	Type	Description	Default
`local_addr`	`tuple[str, int]`	`(host, port)` to bind RTP. Use port `0` for OS-assigned.	`('0.0.0.0', 0)`
`remote_addr`	`tuple[str, int] \| None`	`(host, port)` to send RTP to. May be `None` if supplied per-session via `metadata["remote_addr"]` in :meth:`connect`.	`None`
`payload_type`	`int`	RTP payload type number (default `0` = PCMU).	`0`
`clock_rate`	`int`	Clock rate in Hz (default `8000` for G.711).	`8000`
`dtmf_payload_type`	`int`	RTP payload type for RFC 4733 DTMF events (default `101`).	`101`
`rtcp_interval`	`float`	Seconds between RTCP sender reports.	`5.0`

send_transcription `async` ¶

send_transcription(session, text, role='user')

Log transcription text (no UI channel in RTP mode).

on_dtmf_received ¶

on_dtmf_received(callback)

Register a callback for inbound DTMF digits (RFC 4733).

Parameters:

Name	Type	Description	Default
`callback`	`DTMFReceivedCallback`	Function called with `(session, dtmf_event)`.	required

Install with: pip install roomkit[rtp]

Mock Voice Backend¶

MockVoiceBackend ¶

MockVoiceBackend(*, capabilities=NONE)

Bases: VoiceBackend

Mock voice backend for testing.

Tracks all method calls and provides helpers to simulate events. The backend is a pure transport — no VAD or audio intelligence.

Example

backend = MockVoiceBackend()

Track calls¶

session = await backend.connect("room-1", "user-1", "voice-1") assert backend.calls[-1].method == "connect"

Simulate raw audio received¶

frame = AudioFrame(data=b"audio-data") await backend.simulate_audio_received(session, frame)

Simulate barge-in¶

await backend.simulate_barge_in(session)

simulate_audio_received `async` ¶

simulate_audio_received(session, frame)

Simulate the backend receiving a raw audio frame.

Fires all registered on_audio_received callbacks.

simulate_barge_in `async` ¶

simulate_barge_in(session)

Simulate user speaking while TTS is playing (barge-in).

Fires all registered on_barge_in callbacks.

start_playing ¶

start_playing(session)

Mark a session as playing audio (for barge-in testing).

stop_playing ¶

stop_playing(session)

Mark a session as no longer playing audio.

simulate_session_ready `async` ¶

simulate_session_ready(session)

Simulate the backend signalling that a session's audio path is live.

Fires all registered on_session_ready callbacks.

simulate_client_disconnected `async` ¶

simulate_client_disconnected(session)

Simulate the backend signalling that a client has disconnected.

Fires all registered on_client_disconnected callbacks.

MockVoiceCall `dataclass` ¶

MockVoiceCall(method, args=dict())

Record of a call made to MockVoiceBackend.

Voice Events¶

BargeInEvent `dataclass` ¶

BargeInEvent(session, interrupted_text, audio_position_ms, timestamp=_utcnow())

User started speaking while TTS was playing.

This event is fired when the VAD detects speech starting while audio is being sent to the user. This allows the system to: - Cancel the current TTS playback - Adjust response strategy (e.g., acknowledge interruption) - Track conversation dynamics

session `instance-attribute` ¶

session

The voice session where barge-in occurred.

interrupted_text `instance-attribute` ¶

interrupted_text

The text that was being spoken when interrupted.

audio_position_ms `instance-attribute` ¶

audio_position_ms

How far into the TTS audio playback (in milliseconds).

timestamp `class-attribute` `instance-attribute` ¶

timestamp = field(default_factory=_utcnow)

When the barge-in was detected.

TTSCancelledEvent `dataclass` ¶

TTSCancelledEvent(session, reason, text, audio_position_ms, timestamp=_utcnow())

TTS playback was cancelled.

This event is fired when TTS synthesis or playback is stopped before completion. Reasons include: - barge_in: User started speaking - explicit: Application called interrupt() - disconnect: Session ended - error: TTS or playback error

session `instance-attribute` ¶

session

The voice session where TTS was cancelled.

reason `instance-attribute` ¶

reason

Why the TTS was cancelled.

text `instance-attribute` ¶

text

The text that was being synthesized.

audio_position_ms `instance-attribute` ¶

audio_position_ms

How far into playback (0 if not started).

timestamp `class-attribute` `instance-attribute` ¶

timestamp = field(default_factory=_utcnow)

When the cancellation occurred.

PartialTranscriptionEvent `dataclass` ¶

PartialTranscriptionEvent(session, text, confidence, is_stable, timestamp=_utcnow())

Interim transcription result during speech.

This event is fired by backends that support streaming STT, providing real-time transcription updates before the final result. Use cases include: - Live captions/subtitles - Early intent detection - Visual feedback during speech

session `instance-attribute` ¶

session

The voice session being transcribed.

text `instance-attribute` ¶

text

The current transcription (may change in subsequent events).

confidence `instance-attribute` ¶

confidence

Confidence score (0.0 to 1.0).

is_stable `instance-attribute` ¶

is_stable

True if this portion is unlikely to change significantly.

timestamp `class-attribute` `instance-attribute` ¶

timestamp = field(default_factory=_utcnow)

When this transcription was received.

VADSilenceEvent `dataclass` ¶

VADSilenceEvent(session, silence_duration_ms, timestamp=_utcnow())

Silence detected after speech.

This event is fired when the VAD detects a period of silence following speech. It can be used for: - Early end-of-utterance detection (before full speech_end) - Adaptive silence thresholds - Turn-taking management

session `instance-attribute` ¶

session

The voice session where silence was detected.

silence_duration_ms `instance-attribute` ¶

silence_duration_ms

Duration of silence in milliseconds.

timestamp `class-attribute` `instance-attribute` ¶

timestamp = field(default_factory=_utcnow)

When the silence was detected.

VADAudioLevelEvent `dataclass` ¶

VADAudioLevelEvent(session, level_db, is_speech, timestamp=_utcnow())

Periodic audio level update for UI feedback.

This event is fired periodically (typically 10Hz) to provide audio level information for UI visualization. Use cases include: - Audio level meters - Speaking indicators - Noise detection

session `instance-attribute` ¶

session

The voice session.

level_db `instance-attribute` ¶

level_db

Audio level in dB (typically -60 to 0, where 0 is max).

is_speech `instance-attribute` ¶

is_speech

VAD's determination if this audio contains speech.

timestamp `class-attribute` `instance-attribute` ¶

timestamp = field(default_factory=_utcnow)

When this measurement was taken.

Callback Types¶

Callback	Signature
`SpeechStartCallback`	`(VoiceSession) -> Any`
`SpeechEndCallback`	`(VoiceSession, bytes) -> Any`
`PartialTranscriptionCallback`	`(VoiceSession, str, float, bool) -> Any`
`VADSilenceCallback`	`(VoiceSession, int) -> Any`
`VADAudioLevelCallback`	`(VoiceSession, float, bool) -> Any`
`BargeInCallback`	`(VoiceSession) -> Any`

Voice Providers¶

Voice Backend¶

VoiceBackend ¶

Register raw audio callback¶

Connect a participant¶

Stream audio to the client¶

Disconnect¶

name abstractmethod property ¶

capabilities property ¶

feeds_aec_reference property ¶

supports_playback_callback property ¶

connect async ¶

disconnect abstractmethod async ¶

send_audio abstractmethod async ¶

get_session ¶

list_sessions ¶

close async ¶

on_audio_received ¶

on_session_ready ¶

on_barge_in ¶

cancel_audio async ¶

accept async ¶

interrupt ¶

set_input_muted ¶

set_input_gated ¶

on_client_disconnected ¶

on_speaker_change ¶

end_of_response ¶

is_playing ¶

on_audio_played ¶

set_trace_emitter ¶

send_transcription async ¶

VoiceCapability ¶

NONE class-attribute instance-attribute ¶

INTERRUPTION class-attribute instance-attribute ¶

BARGE_IN class-attribute instance-attribute ¶

NATIVE_AEC class-attribute instance-attribute ¶

NATIVE_AGC class-attribute instance-attribute ¶

DTMF_INBAND class-attribute instance-attribute ¶

DTMF_SIGNALING class-attribute instance-attribute ¶

VoiceSession dataclass ¶

VoiceSessionState ¶

AudioChunk dataclass ¶

TranscriptionResult dataclass ¶

is_speech_start class-attribute instance-attribute ¶

STT (Speech-to-Text)¶

STTProvider ¶

name property ¶

supports_streaming property ¶

transcribe abstractmethod async ¶

transcribe_stream async ¶

warmup async ¶

close async ¶

MockSTTProvider ¶

Sherpa-ONNX STT¶

SherpaOnnxSTTProvider ¶

warmup async ¶

transcribe async ¶

transcribe_stream async ¶

SherpaOnnxSTTConfig dataclass ¶

Usage¶

TTS (Text-to-Speech)¶

TTSProvider ¶

name property ¶

default_voice property ¶

supports_streaming_input property ¶

synthesize abstractmethod async ¶

synthesize_stream_input async ¶

synthesize_stream async ¶

warmup async ¶

close async ¶

MockTTSProvider ¶

Sherpa-ONNX TTS¶

SherpaOnnxTTSProvider ¶

warmup async ¶

synthesize async ¶

synthesize_stream async ¶

SherpaOnnxTTSConfig dataclass ¶

Usage¶

RTP Backend¶

name `abstractmethod` `property` ¶

capabilities `property` ¶

feeds_aec_reference `property` ¶

supports_playback_callback `property` ¶

connect `async` ¶

disconnect `abstractmethod` `async` ¶

send_audio `abstractmethod` `async` ¶

close `async` ¶

cancel_audio `async` ¶

accept `async` ¶

send_transcription `async` ¶

NONE `class-attribute` `instance-attribute` ¶

INTERRUPTION `class-attribute` `instance-attribute` ¶

BARGE_IN `class-attribute` `instance-attribute` ¶

NATIVE_AEC `class-attribute` `instance-attribute` ¶

NATIVE_AGC `class-attribute` `instance-attribute` ¶

DTMF_INBAND `class-attribute` `instance-attribute` ¶

DTMF_SIGNALING `class-attribute` `instance-attribute` ¶

VoiceSession `dataclass` ¶

AudioChunk `dataclass` ¶

TranscriptionResult `dataclass` ¶

is_speech_start `class-attribute` `instance-attribute` ¶

name `property` ¶

supports_streaming `property` ¶

transcribe `abstractmethod` `async` ¶

transcribe_stream `async` ¶

warmup `async` ¶

close `async` ¶

warmup `async` ¶

transcribe `async` ¶

transcribe_stream `async` ¶

SherpaOnnxSTTConfig `dataclass` ¶

name `property` ¶

default_voice `property` ¶

supports_streaming_input `property` ¶

synthesize `abstractmethod` `async` ¶

synthesize_stream_input `async` ¶

synthesize_stream `async` ¶

warmup `async` ¶

close `async` ¶

warmup `async` ¶

synthesize `async` ¶

synthesize_stream `async` ¶

SherpaOnnxTTSConfig `dataclass` ¶

send_transcription `async` ¶

simulate_audio_received `async` ¶

simulate_barge_in `async` ¶

simulate_session_ready `async` ¶

simulate_client_disconnected `async` ¶

MockVoiceCall `dataclass` ¶

BargeInEvent `dataclass` ¶

session `instance-attribute` ¶

interrupted_text `instance-attribute` ¶

audio_position_ms `instance-attribute` ¶

timestamp `class-attribute` `instance-attribute` ¶

TTSCancelledEvent `dataclass` ¶

session `instance-attribute` ¶

reason `instance-attribute` ¶

text `instance-attribute` ¶

audio_position_ms `instance-attribute` ¶

timestamp `class-attribute` `instance-attribute` ¶

PartialTranscriptionEvent `dataclass` ¶

session `instance-attribute` ¶

text `instance-attribute` ¶

confidence `instance-attribute` ¶

is_stable `instance-attribute` ¶

timestamp `class-attribute` `instance-attribute` ¶

VADSilenceEvent `dataclass` ¶

session `instance-attribute` ¶

silence_duration_ms `instance-attribute` ¶

timestamp `class-attribute` `instance-attribute` ¶

VADAudioLevelEvent `dataclass` ¶

session `instance-attribute` ¶

level_db `instance-attribute` ¶

is_speech `instance-attribute` ¶

timestamp `class-attribute` `instance-attribute` ¶