Voice Interruption Handling¶
RoomKit provides four interruption strategies to control what happens when a user speaks while the AI is talking. This guide covers configuration, backchannel detection, and practical patterns for each strategy.
Quick Start¶
from __future__ import annotations
from roomkit.channels import VoiceChannel
from roomkit.voice.interruption import InterruptionConfig, InterruptionStrategy
voice = VoiceChannel(
"voice",
stt=stt,
tts=tts,
backend=backend,
interruption=InterruptionConfig(
strategy=InterruptionStrategy.CONFIRMED,
min_speech_ms=300,
),
)
Strategy Comparison¶
| Strategy | Behavior | Use Case | False Positives |
|---|---|---|---|
| IMMEDIATE | Cancel TTS on any speech detection | IVR, command interfaces | Highest |
| CONFIRMED | Wait for min_speech_ms before interrupting |
General conversation | Low |
| SEMANTIC | Use backchannel detection to distinguish "uh-huh" from real interruptions | Customer support, therapy | Lowest |
| DISABLED | Ignore user speech during playback | Announcements, legal disclaimers | None |
Configuration¶
from roomkit.voice.interruption import InterruptionConfig, InterruptionStrategy
config = InterruptionConfig(
strategy=InterruptionStrategy.CONFIRMED,
min_speech_ms=300,
allow_during_first_ms=0,
flush_partial_tts=True,
keep_partial_transcript=True,
backchannel_detector=None,
)
| Field | Default | Description |
|---|---|---|
strategy |
CONFIRMED |
Which interruption strategy to use |
min_speech_ms |
300 |
Minimum speech duration (ms) before confirming interruption (CONFIRMED strategy) |
allow_during_first_ms |
0 |
Grace period: don't interrupt during the first N ms of playback |
flush_partial_tts |
True |
Flush buffered TTS audio on interruption |
keep_partial_transcript |
True |
Keep the partial transcript when interrupted |
backchannel_detector |
None |
Required for SEMANTIC strategy |
Strategy Details¶
IMMEDIATE¶
Cancels TTS playback the instant any speech is detected. Best for systems where responsiveness matters more than naturalness.
voice = VoiceChannel(
"voice",
stt=stt, tts=tts, backend=backend,
interruption=InterruptionConfig(
strategy=InterruptionStrategy.IMMEDIATE,
allow_during_first_ms=200, # Prevent echo-triggered interruptions
),
)
Tip
Use allow_during_first_ms to add a grace period at the start of playback. This prevents the speaker's own audio echoing back and triggering a false interruption.
CONFIRMED¶
Waits until the user has been speaking for at least min_speech_ms before confirming the interruption. Filters out brief noise and accidental sounds.
voice = VoiceChannel(
"voice",
stt=stt, tts=tts, backend=backend,
interruption=InterruptionConfig(
strategy=InterruptionStrategy.CONFIRMED,
min_speech_ms=300, # Require 300ms of continuous speech
),
)
SEMANTIC¶
Uses a BackchannelDetector to distinguish backchannels ("uh-huh", "yeah", "ok") from real interruptions. Falls back to CONFIRMED if no detector is configured.
voice = VoiceChannel(
"voice",
stt=stt, tts=tts, backend=backend,
interruption=InterruptionConfig(
strategy=InterruptionStrategy.SEMANTIC,
backchannel_detector=my_detector,
),
)
DISABLED¶
Ignores all user speech during playback. The AI finishes speaking before processing any user input.
voice = VoiceChannel(
"voice",
stt=stt, tts=tts, backend=backend,
interruption=InterruptionConfig(
strategy=InterruptionStrategy.DISABLED,
),
)
Backchannel Detection¶
The SEMANTIC strategy requires a BackchannelDetector to classify short utterances:
from __future__ import annotations
from roomkit.voice.pipeline.backchannel import (
BackchannelContext,
BackchannelDecision,
BackchannelDetector,
)
class KeywordBackchannelDetector(BackchannelDetector):
"""Simple keyword-based backchannel detector."""
BACKCHANNELS = {"uh-huh", "yeah", "ok", "mhm", "right", "sure", "hmm", "yes"}
@property
def name(self) -> str:
return "keyword_backchannel"
def classify(self, context: BackchannelContext) -> BackchannelDecision:
text = (context.transcript or "").strip().lower()
is_bc = text in self.BACKCHANNELS and context.speech_duration_ms < 1000
return BackchannelDecision(
is_backchannel=is_bc,
confidence=0.9 if is_bc else 0.1,
label="acknowledgement" if is_bc else None,
)
BackchannelContext¶
The detector receives rich context for classification:
| Field | Type | Description |
|---|---|---|
transcript |
str \| None |
Transcribed text of the utterance |
speech_duration_ms |
float |
Duration of the utterance |
audio_bytes |
bytes \| None |
Raw audio for acoustic classifiers |
bot_speech_progress |
float |
How far into the bot's utterance (0.0-1.0) |
session_id |
str |
Voice session ID |
metadata |
dict |
Additional context |
Pipeline Integration¶
Interruption config can also be set via AudioPipelineConfig:
from __future__ import annotations
from roomkit.channels import VoiceChannel
from roomkit.voice.interruption import InterruptionConfig, InterruptionStrategy
from roomkit.voice.pipeline import AudioPipelineConfig
pipeline = AudioPipelineConfig(
vad=vad,
denoiser=denoiser,
backchannel_detector=my_detector,
interruption=InterruptionConfig(
strategy=InterruptionStrategy.SEMANTIC,
min_speech_ms=300,
),
)
voice = VoiceChannel(
"voice",
stt=stt, tts=tts,
backend=backend,
pipeline=pipeline,
)
Configuration Priority
- Explicit
interruptionparameter on VoiceChannel (highest) pipeline.interruptionfrom AudioPipelineConfig- Legacy
enable_barge_in+barge_in_threshold_ms(fallback)
Hook Integration¶
Three hooks fire during interruption events:
ON_BARGE_IN¶
Fires when the user successfully interrupts the AI:
from __future__ import annotations
from roomkit import HookTrigger, HookExecution
@kit.hook(HookTrigger.ON_BARGE_IN, execution=HookExecution.ASYNC)
async def on_barge_in(event, ctx):
# event is a BargeInEvent
print(f"User interrupted at {event.audio_position_ms}ms")
print(f"AI was saying: {event.interrupted_text}")
ON_TTS_CANCELLED¶
Fires when TTS playback is cancelled for any reason:
@kit.hook(HookTrigger.ON_TTS_CANCELLED, execution=HookExecution.ASYNC)
async def on_tts_cancelled(event, ctx):
# event is a TTSCancelledEvent
# event.reason: "barge_in" | "explicit" | "disconnect" | "error"
if event.reason == "barge_in":
print(f"TTS stopped at {event.audio_position_ms}ms due to barge-in")
ON_BACKCHANNEL¶
Fires when the SEMANTIC strategy detects a backchannel (not a real interruption):
@kit.hook(HookTrigger.ON_BACKCHANNEL, execution=HookExecution.ASYNC)
async def on_backchannel(event, ctx):
# event is a BackchannelEvent
print(f"Backchannel: '{event.text}' (confidence={event.confidence})")
Grace Period: allow_during_first_ms¶
The allow_during_first_ms field prevents interruptions during the first N milliseconds of playback. This is useful for:
- Echo prevention: When AEC isn't perfect, the first few hundred ms of playback can echo back and trigger false interruptions
- Critical content: Ensure the AI's opening words are always heard
InterruptionConfig(
strategy=InterruptionStrategy.IMMEDIATE,
allow_during_first_ms=500, # Don't interrupt during first 500ms
)
Legacy Compatibility¶
The older enable_barge_in and barge_in_threshold_ms parameters still work and map to InterruptionConfig internally:
# Legacy (still works):
VoiceChannel("voice", enable_barge_in=True, barge_in_threshold_ms=300)
# Equivalent to:
# InterruptionConfig(strategy=IMMEDIATE, allow_during_first_ms=300)
# Legacy disabled:
VoiceChannel("voice", enable_barge_in=False)
# Equivalent to:
# InterruptionConfig(strategy=DISABLED)
Warning
Prefer the explicit interruption parameter over legacy fields for new code.
Testing¶
Use MockBackchannelDetector for deterministic tests:
from __future__ import annotations
from roomkit.voice.interruption import InterruptionConfig, InterruptionHandler, InterruptionStrategy
from roomkit.voice.pipeline.backchannel import BackchannelDecision
from roomkit.voice.pipeline.backchannel.mock import MockBackchannelDetector
# Test CONFIRMED strategy
config = InterruptionConfig(strategy=InterruptionStrategy.CONFIRMED, min_speech_ms=300)
handler = InterruptionHandler(config)
decision = handler.evaluate(playback_position_ms=500, speech_duration_ms=400)
assert decision.should_interrupt
# Test SEMANTIC with backchannel
detector = MockBackchannelDetector(
decisions=[BackchannelDecision(is_backchannel=True, confidence=0.95)]
)
config = InterruptionConfig(
strategy=InterruptionStrategy.SEMANTIC,
backchannel_detector=detector,
)
handler = InterruptionHandler(config)
decision = handler.evaluate(playback_position_ms=500, speech_duration_ms=300, speech_text="uh-huh")
assert not decision.should_interrupt
assert decision.is_backchannel