Audio Bridging¶

Audio bridging enables direct session-to-session audio forwarding for human-to-human voice calls. Audio flows between participants without the STT/TTS roundtrip, preserving natural voice quality and minimizing latency.

Quick Start¶

Enable bridging by passing bridge=True to VoiceChannel:

from roomkit import RoomKit, VoiceChannel
from roomkit.voice.backends.sip import SIPVoiceBackend

backend = SIPVoiceBackend(
    local_sip_addr=("0.0.0.0", 5060),
    local_rtp_ip="0.0.0.0",
)

voice = VoiceChannel("voice", backend=backend, bridge=True)

kit = RoomKit()
kit.register_channel(voice)

When two participants join the same room and are bound to the voice channel, audio from each is forwarded to the other automatically.

How It Works¶

Session A mic → Backend → Inbound Pipeline → AudioBridge.forward()
                                                  │
                                                  ├─→ Outbound Pipeline → Session B speaker
                                                  └─→ VAD/STT (optional, parallel)

Inbound audio from each session passes through the full inbound pipeline (resampler, AEC, AGC, denoiser, VAD).
AudioBridge receives the processed frame and forwards it to all other sessions in the same room.
Outbound pipeline processes each forwarded frame per target (recorder tap, AEC reference feeding, resampler) before it reaches the transport.
STT path (if configured) runs in parallel -- bridging and transcription are independent.

Configuration¶

AudioBridgeConfig¶

from roomkit.voice.bridge import AudioBridgeConfig

config = AudioBridgeConfig(
    max_participants=10,         # Max sessions per room (default: 10)
    mixing_strategy="forward",   # "forward" (2-party) or "mix" (N-party)
)

voice = VoiceChannel("voice", backend=backend, bridge=config)

Parameter	Default	Description
`enabled`	`True`	Whether bridging is active. Set `False` to pause forwarding.
`max_participants`	`10`	Maximum bridged sessions per room. Raises `RuntimeError` if exceeded.
`mixing_strategy`	`"forward"`	`"forward"` sends frames directly. `"mix"` does N-party additive mixing.
`mixer`	Auto	Mixer provider for PCM mixing. Auto-detects NumPy (fast) or falls back to pure Python.

N-Party Mixing¶

For conference calls with 3+ participants, use mixing_strategy="mix":

config = AudioBridgeConfig(mixing_strategy="mix")
voice = VoiceChannel("voice", backend=backend, bridge=config)

Each participant hears a mix of all other participants' audio (excluding their own). The mixing algorithm:

2 sources: summed sample-by-sample with int16 clipping
3+ sources: averaged (divided by N) with int16 clamping

All arithmetic uses int32 accumulation to avoid overflow during intermediate steps.

MixerProvider¶

The mixer is pluggable via the MixerProvider ABC:

from roomkit.voice.bridge import AudioBridgeConfig
from roomkit.voice.pipeline.mixer.numpy import NumpyMixerProvider
from roomkit.voice.pipeline.mixer.python import PythonMixerProvider

# Explicit NumPy mixer (~20x faster than pure Python)
config = AudioBridgeConfig(
    mixing_strategy="mix",
    mixer=NumpyMixerProvider(),
)

# Pure Python fallback (no dependencies)
config = AudioBridgeConfig(
    mixing_strategy="mix",
    mixer=PythonMixerProvider(),
)

# Auto-detect (default): NumPy if installed, else Python
config = AudioBridgeConfig(mixing_strategy="mix")

Cross-Rate Resampling¶

When participants have different native sample rates (e.g., SIP at 8kHz and WebRTC at 48kHz), the bridge automatically resamples audio to match each target's rate:

SIP (8kHz) ──► Bridge ──► resample to 48kHz ──► WebRTC participant
WebRTC (48kHz) ──► Bridge ──► resample to 8kHz ──► SIP participant

In mix mode, all source frames are resampled to the target's rate before mixing to ensure temporal alignment. Resampling uses a cached LinearResamplerProvider singleton for efficiency.

When a frame_processor is set (i.e., the outbound pipeline handles resampling), the bridge skips its own resampling to avoid double resampling.

Bridge + STT (Transcription)¶

Bridge mode and STT run in parallel. This is useful for compliance recording, live transcription display, or AI monitoring:

voice = VoiceChannel(
    "voice",
    backend=backend,
    bridge=True,
    stt=deepgram_provider,
)

Audio frames pass through both the bridge (forwarding to other participants) and the VAD/STT pipeline (transcription) simultaneously.

Bridge + TTS (AI Moderator)¶

When both bridge and TTS are active, the AI can speak into a bridged call via say(). TTS audio and bridge audio coexist -- neither blocks the other:

voice = VoiceChannel(
    "voice",
    backend=backend,
    bridge=True,
    stt=deepgram_provider,
    tts=elevenlabs_provider,
)

# AI can speak into the bridged call
await voice.say(session, "This call may be recorded for quality purposes.")

Echo suppression during TTS

After TTS playback completes, there is a 2-second echo decay window during which speech detection is suppressed. Schedule AI interjections before or well after participant speech segments.

Bridge + Interruption¶

When a participant speaks while TTS is playing (barge-in), the interruption handler cancels TTS but does not stop bridge forwarding. This ensures:

AI speech is interrupted (TTS cancelled)
Human-to-human bridge audio continues uninterrupted
The participant's speech is still forwarded to other bridged sessions

Frame Filtering¶

Use set_bridge_filter() to inspect or modify audio frames before they are forwarded. The filter runs synchronously in the audio callback thread and must complete quickly (< 1ms):

# Mute a specific participant
def mute_filter(session, frame):
    if session.id == muted_session_id:
        return None  # drop frame (mute)
    return frame

voice.set_bridge_filter(mute_filter)

# Remove the filter
voice.set_bridge_filter(None)

The filter receives (source_session, AudioFrame) and returns:

The frame (unchanged or modified) to forward it
None to drop the frame (mute the source for that frame)

Hooks¶

BEFORE_BRIDGE_AUDIO¶

The BEFORE_BRIDGE_AUDIO hook fires for each audio frame before it is forwarded to other sessions. It supports HookResult.block() to drop individual frames:

@kit.hook(HookTrigger.BEFORE_BRIDGE_AUDIO, execution=HookExecution.SYNC)
async def monitor_bridge(event, ctx):
    # event.session — source session
    # event.frame — the AudioFrame about to be forwarded
    # event.room_id — the room where bridging is active

    if should_mute(event.session):
        return HookResult.block(reason="participant_muted")
    return HookResult.allow()

Performance: fast path

When no BEFORE_BRIDGE_AUDIO hooks are registered, the bridge forwards frames directly in the audio thread with zero overhead. When hooks are registered, frames are routed through the event loop for hook evaluation.

For synchronous, ultra-low-latency frame filtering (< 1ms), use set_bridge_filter() instead.

Other Voice Hooks¶

These hooks fire normally during bridge mode:

ON_SPEECH_START / ON_SPEECH_END -- VAD still runs on bridged audio
ON_TRANSCRIPTION -- if STT is configured
ON_RECORDING_STARTED / ON_RECORDING_STOPPED -- recorder captures both sides
ON_SESSION_STARTED -- session lifecycle unchanged

Session Lifecycle¶

Sessions are registered with the bridge when bound to the voice channel and unregistered when unbound:

# Join: binds session and adds to bridge
# Previously voice.bind_session() / voice.unbind_session()
await kit.join("room-1", "voice", session=session)

# Leave: unbinds session and removes from bridge
await kit.leave(session)

Thread Safety¶

All bridge operations are thread-safe. AudioBridge.forward() is called from audio callback threads (the same context as on_audio_received). Internal state is protected by a threading.Lock.

Examples¶

Mock Conference + AI Summary¶

examples/voice_bridge_summary.py — 3-party mock conference with live transcription. When all participants leave, an AI summarizes the conversation.

uv run python examples/voice_bridge_summary.py

SIP Bridge + Deepgram + Anthropic¶

examples/voice_sip_bridge_summary.py — Real SIP calls bridged with Deepgram STT (multilingual auto-detect) and Claude for meeting summaries in the same language as the speakers.

DEEPGRAM_API_KEY=... ANTHROPIC_API_KEY=... \
    uv run python examples/voice_sip_bridge_summary.py

Bridge + AI Moderator¶

examples/voice_bridge_with_ai.py — Two participants bridged while an AI moderator interjects via TTS. Demonstrates bridge + STT + TTS coexistence.

uv run python examples/voice_bridge_with_ai.py