Skip to content

Voice Interruption Handling

RoomKit provides four interruption strategies to control what happens when a user speaks while the AI is talking. This guide covers configuration, backchannel detection, and practical patterns for each strategy.

Quick Start

from __future__ import annotations

from roomkit.channels import VoiceChannel
from roomkit.voice.interruption import InterruptionConfig, InterruptionStrategy

voice = VoiceChannel(
    "voice",
    stt=stt,
    tts=tts,
    backend=backend,
    interruption=InterruptionConfig(
        strategy=InterruptionStrategy.CONFIRMED,
        min_speech_ms=300,
    ),
)

Strategy Comparison

Strategy Behavior Use Case False Positives
IMMEDIATE Cancel TTS on any speech detection IVR, command interfaces Highest
CONFIRMED Wait for min_speech_ms before interrupting General conversation Low
SEMANTIC Use backchannel detection to distinguish "uh-huh" from real interruptions Customer support, therapy Lowest
DISABLED Ignore user speech during playback Announcements, legal disclaimers None

Configuration

from roomkit.voice.interruption import InterruptionConfig, InterruptionStrategy

config = InterruptionConfig(
    strategy=InterruptionStrategy.CONFIRMED,
    min_speech_ms=300,
    allow_during_first_ms=0,
    flush_partial_tts=True,
    keep_partial_transcript=True,
    backchannel_detector=None,
)
Field Default Description
strategy CONFIRMED Which interruption strategy to use
min_speech_ms 300 Minimum speech duration (ms) before confirming interruption (CONFIRMED strategy)
allow_during_first_ms 0 Grace period: don't interrupt during the first N ms of playback
flush_partial_tts True Flush buffered TTS audio on interruption
keep_partial_transcript True Keep the partial transcript when interrupted
backchannel_detector None Required for SEMANTIC strategy

Strategy Details

IMMEDIATE

Cancels TTS playback the instant any speech is detected. Best for systems where responsiveness matters more than naturalness.

voice = VoiceChannel(
    "voice",
    stt=stt, tts=tts, backend=backend,
    interruption=InterruptionConfig(
        strategy=InterruptionStrategy.IMMEDIATE,
        allow_during_first_ms=200,  # Prevent echo-triggered interruptions
    ),
)

Tip

Use allow_during_first_ms to add a grace period at the start of playback. This prevents the speaker's own audio echoing back and triggering a false interruption.

CONFIRMED

Waits until the user has been speaking for at least min_speech_ms before confirming the interruption. Filters out brief noise and accidental sounds.

voice = VoiceChannel(
    "voice",
    stt=stt, tts=tts, backend=backend,
    interruption=InterruptionConfig(
        strategy=InterruptionStrategy.CONFIRMED,
        min_speech_ms=300,  # Require 300ms of continuous speech
    ),
)

SEMANTIC

Uses a BackchannelDetector to distinguish backchannels ("uh-huh", "yeah", "ok") from real interruptions. Falls back to CONFIRMED if no detector is configured.

voice = VoiceChannel(
    "voice",
    stt=stt, tts=tts, backend=backend,
    interruption=InterruptionConfig(
        strategy=InterruptionStrategy.SEMANTIC,
        backchannel_detector=my_detector,
    ),
)

DISABLED

Ignores all user speech during playback. The AI finishes speaking before processing any user input.

voice = VoiceChannel(
    "voice",
    stt=stt, tts=tts, backend=backend,
    interruption=InterruptionConfig(
        strategy=InterruptionStrategy.DISABLED,
    ),
)

Backchannel Detection

The SEMANTIC strategy requires a BackchannelDetector to classify short utterances:

from __future__ import annotations

from roomkit.voice.pipeline.backchannel import (
    BackchannelContext,
    BackchannelDecision,
    BackchannelDetector,
)


class KeywordBackchannelDetector(BackchannelDetector):
    """Simple keyword-based backchannel detector."""

    BACKCHANNELS = {"uh-huh", "yeah", "ok", "mhm", "right", "sure", "hmm", "yes"}

    @property
    def name(self) -> str:
        return "keyword_backchannel"

    def classify(self, context: BackchannelContext) -> BackchannelDecision:
        text = (context.transcript or "").strip().lower()
        is_bc = text in self.BACKCHANNELS and context.speech_duration_ms < 1000
        return BackchannelDecision(
            is_backchannel=is_bc,
            confidence=0.9 if is_bc else 0.1,
            label="acknowledgement" if is_bc else None,
        )

BackchannelContext

The detector receives rich context for classification:

Field Type Description
transcript str \| None Transcribed text of the utterance
speech_duration_ms float Duration of the utterance
audio_bytes bytes \| None Raw audio for acoustic classifiers
bot_speech_progress float How far into the bot's utterance (0.0-1.0)
session_id str Voice session ID
metadata dict Additional context

Pipeline Integration

Interruption config can also be set via AudioPipelineConfig:

from __future__ import annotations

from roomkit.channels import VoiceChannel
from roomkit.voice.interruption import InterruptionConfig, InterruptionStrategy
from roomkit.voice.pipeline import AudioPipelineConfig

pipeline = AudioPipelineConfig(
    vad=vad,
    denoiser=denoiser,
    backchannel_detector=my_detector,
    interruption=InterruptionConfig(
        strategy=InterruptionStrategy.SEMANTIC,
        min_speech_ms=300,
    ),
)

voice = VoiceChannel(
    "voice",
    stt=stt, tts=tts,
    backend=backend,
    pipeline=pipeline,
)

Configuration Priority

  1. Explicit interruption parameter on VoiceChannel (highest)
  2. pipeline.interruption from AudioPipelineConfig
  3. Legacy enable_barge_in + barge_in_threshold_ms (fallback)

Hook Integration

Three hooks fire during interruption events:

ON_BARGE_IN

Fires when the user successfully interrupts the AI:

from __future__ import annotations

from roomkit import HookTrigger, HookExecution

@kit.hook(HookTrigger.ON_BARGE_IN, execution=HookExecution.ASYNC)
async def on_barge_in(event, ctx):
    # event is a BargeInEvent
    print(f"User interrupted at {event.audio_position_ms}ms")
    print(f"AI was saying: {event.interrupted_text}")

ON_TTS_CANCELLED

Fires when TTS playback is cancelled for any reason:

@kit.hook(HookTrigger.ON_TTS_CANCELLED, execution=HookExecution.ASYNC)
async def on_tts_cancelled(event, ctx):
    # event is a TTSCancelledEvent
    # event.reason: "barge_in" | "explicit" | "disconnect" | "error"
    if event.reason == "barge_in":
        print(f"TTS stopped at {event.audio_position_ms}ms due to barge-in")

ON_BACKCHANNEL

Fires when the SEMANTIC strategy detects a backchannel (not a real interruption):

@kit.hook(HookTrigger.ON_BACKCHANNEL, execution=HookExecution.ASYNC)
async def on_backchannel(event, ctx):
    # event is a BackchannelEvent
    print(f"Backchannel: '{event.text}' (confidence={event.confidence})")

Grace Period: allow_during_first_ms

The allow_during_first_ms field prevents interruptions during the first N milliseconds of playback. This is useful for:

  • Echo prevention: When AEC isn't perfect, the first few hundred ms of playback can echo back and trigger false interruptions
  • Critical content: Ensure the AI's opening words are always heard
InterruptionConfig(
    strategy=InterruptionStrategy.IMMEDIATE,
    allow_during_first_ms=500,  # Don't interrupt during first 500ms
)

Legacy Compatibility

The older enable_barge_in and barge_in_threshold_ms parameters still work and map to InterruptionConfig internally:

# Legacy (still works):
VoiceChannel("voice", enable_barge_in=True, barge_in_threshold_ms=300)
# Equivalent to:
# InterruptionConfig(strategy=IMMEDIATE, allow_during_first_ms=300)

# Legacy disabled:
VoiceChannel("voice", enable_barge_in=False)
# Equivalent to:
# InterruptionConfig(strategy=DISABLED)

Warning

Prefer the explicit interruption parameter over legacy fields for new code.

Testing

Use MockBackchannelDetector for deterministic tests:

from __future__ import annotations

from roomkit.voice.interruption import InterruptionConfig, InterruptionHandler, InterruptionStrategy
from roomkit.voice.pipeline.backchannel import BackchannelDecision
from roomkit.voice.pipeline.backchannel.mock import MockBackchannelDetector

# Test CONFIRMED strategy
config = InterruptionConfig(strategy=InterruptionStrategy.CONFIRMED, min_speech_ms=300)
handler = InterruptionHandler(config)
decision = handler.evaluate(playback_position_ms=500, speech_duration_ms=400)
assert decision.should_interrupt

# Test SEMANTIC with backchannel
detector = MockBackchannelDetector(
    decisions=[BackchannelDecision(is_backchannel=True, confidence=0.95)]
)
config = InterruptionConfig(
    strategy=InterruptionStrategy.SEMANTIC,
    backchannel_detector=detector,
)
handler = InterruptionHandler(config)
decision = handler.evaluate(playback_position_ms=500, speech_duration_ms=300, speech_text="uh-huh")
assert not decision.should_interrupt
assert decision.is_backchannel