Audio Pipeline¶
Audio processing pipeline for voice channels. See the Audio Pipeline Stages guide for usage examples.
Pipeline¶
AudioPipeline ¶
AudioPipeline(config, *, backend_capabilities=NONE, backend_feeds_aec_reference=False)
Orchestrates audio frame processing through pipeline stages.
Inbound processing order
[Resampler] -> [Recorder tap] -> [DTMF] -> [AEC] -> [AGC] -> [Denoiser] -> [VAD] -> [Diarization]
Outbound processing order
[PostProcessors] -> [Recorder tap] -> AEC.feed_reference -> [Resampler]
AEC and AGC stages are skipped when the backend declares NATIVE_AEC / NATIVE_AGC capabilities.
on_speech_frame ¶
Register callback for processed audio frames during speech.
on_processed_frame ¶
Register callback for every processed inbound frame.
Fires after all pipeline stages (AEC, denoiser, VAD, etc.) for every frame, regardless of speech state. Used by continuous STT streaming when no local VAD is configured.
set_parent_span ¶
Set the parent span (VOICE_SESSION) for pipeline spans.
process_frame ¶
Process a single inbound audio frame through the pipeline.
Backwards-compatible alias for process_inbound().
process_inbound ¶
Process a single inbound audio frame through the pipeline.
[Resampler] -> [Recorder tap] -> [DTMF] -> [AEC] -> [AGC] ->
[Denoiser] -> [VAD] -> [Diarization]
process_outbound ¶
Process a single outbound audio frame through the pipeline.
[PostProcessors] -> [Recorder tap] -> AEC.feed_reference ->
[Resampler]
enable_playback_aec_feed ¶
Mark that AEC reference is fed at playback time.
When called, process_outbound() skips its own
aec.feed_reference() to avoid double-feeding with
misaligned timing.
feed_aec_reference ¶
Feed an AEC reference frame directly (from speaker output).
Called by the backend's speaker callback at playback time so the AEC has time-aligned reference for echo cancellation.
Thread-safety: may be called from the audio I/O thread. Uses
a separate resampler instance from process_outbound to
avoid thread-safety issues.
on_session_active ¶
Called when a voice session becomes active.
Cleans up stale state for this session and starts recording if configured.
on_session_ended ¶
Called when a voice session ends.
Stops recording and debug taps if active.
AudioPipelineConfig
dataclass
¶
AudioPipelineConfig(vad=None, denoiser=None, diarization=None, postprocessors=list(), vad_config=None, aec=None, agc=None, agc_config=None, dtmf=None, turn_detector=None, backchannel_detector=None, recorder=None, recording_config=None, interruption=None, resampler=None, contract=None, debug_taps=None, telemetry=None)
Configuration for the audio processing pipeline.
All stages are optional. At least one provider should be set for the pipeline to be useful.
Typical combinations:
- VoiceChannel (STT path):
vad(+ optional denoiser/diarization) - RealtimeVoiceChannel (speech-to-speech):
denoiserand/ordiarization(VAD not needed — the AI provider handles turn detection)
diarization
class-attribute
instance-attribute
¶
Optional speaker diarization applied after VAD.
postprocessors
class-attribute
instance-attribute
¶
Optional postprocessors applied on the outbound path.
vad_config
class-attribute
instance-attribute
¶
Optional VAD-specific configuration override.
agc_config
class-attribute
instance-attribute
¶
Optional AGC-specific configuration override.
dtmf
class-attribute
instance-attribute
¶
Optional DTMF tone detector (runs in parallel with main chain).
turn_detector
class-attribute
instance-attribute
¶
Optional post-STT turn completion detector.
backchannel_detector
class-attribute
instance-attribute
¶
Optional backchannel detector for semantic interruption strategy.
recording_config
class-attribute
instance-attribute
¶
Optional recording configuration.
interruption
class-attribute
instance-attribute
¶
Optional interruption (barge-in) configuration.
resampler
class-attribute
instance-attribute
¶
Optional resampler provider for format conversion.
contract
class-attribute
instance-attribute
¶
Optional input/output format contract.
debug_taps
class-attribute
instance-attribute
¶
Optional diagnostic audio capture at pipeline stage boundaries.
telemetry
class-attribute
instance-attribute
¶
Optional telemetry provider for pipeline metrics.
AudioFrame
dataclass
¶
A single frame of inbound audio for pipeline processing.
AudioFrame flows through the audio pipeline stages: denoiser -> VAD -> diarization. Each stage may annotate the metadata dict with its results.
This is distinct from AudioChunk, which is used for outbound TTS audio streaming.
VAD (Voice Activity Detection)¶
VADProvider ¶
Bases: ABC
Abstract base class for Voice Activity Detection providers.
process
abstractmethod
¶
Process an audio frame and optionally return a VAD event.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
frame
|
AudioFrame
|
The audio frame to analyse. |
required |
Returns:
| Type | Description |
|---|---|
VADEvent | None
|
A VADEvent if a state transition occurred, else None. |
VADConfig
dataclass
¶
Configuration for VAD processing.
silence_threshold_ms
class-attribute
instance-attribute
¶
Milliseconds of silence before triggering SPEECH_END.
speech_pad_ms
class-attribute
instance-attribute
¶
Padding added around detected speech segments.
min_speech_duration_ms
class-attribute
instance-attribute
¶
Minimum speech duration to trigger events.
extra
class-attribute
instance-attribute
¶
Provider-specific configuration.
VADEvent
dataclass
¶
Event produced by a VAD provider.
audio_bytes
class-attribute
instance-attribute
¶
Speech audio. Set on SPEECH_START (pre-roll buffer) and SPEECH_END (full accumulated speech including pre-roll).
duration_ms
class-attribute
instance-attribute
¶
Duration in milliseconds (speech or silence).
level_db
class-attribute
instance-attribute
¶
Audio level in dB (set on AUDIO_LEVEL).
VADEventType ¶
Bases: StrEnum
Types of VAD events.
MockVADProvider ¶
Bases: VADProvider
Mock VAD provider that returns a preconfigured sequence of events.
Example
from roomkit.voice.pipeline.vad.base import VADEvent, VADEventType
events = [ VADEvent(type=VADEventType.SPEECH_START), None, VADEvent(type=VADEventType.SPEECH_END, audio_bytes=b"audio"), ] vad = MockVADProvider(events=events)
Denoiser¶
DenoiserProvider ¶
Bases: ABC
Abstract base class for audio denoising providers.
process
abstractmethod
¶
Denoise an audio frame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
frame
|
AudioFrame
|
The noisy audio frame. |
required |
Returns:
| Type | Description |
|---|---|
AudioFrame
|
A new or modified AudioFrame with reduced noise. |
MockDenoiserProvider ¶
Bases: DenoiserProvider
Mock denoiser that passes frames through unchanged.
Tracks processed frames for test assertions.
Diarization¶
DiarizationProvider ¶
Bases: ABC
Abstract base class for speaker diarization providers.
process
abstractmethod
¶
Analyse an audio frame for speaker identity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
frame
|
AudioFrame
|
The audio frame to analyse. |
required |
Returns:
| Type | Description |
|---|---|
DiarizationResult | None
|
A DiarizationResult if a speaker was identified, else None. |
DiarizationResult
dataclass
¶
Result from a diarization provider.
MockDiarizationProvider ¶
Bases: DiarizationProvider
Mock diarization provider that returns a preconfigured sequence of results.
Example
results = [ DiarizationResult(speaker_id="speaker_0", confidence=0.9, is_new_speaker=True), DiarizationResult(speaker_id="speaker_0", confidence=0.95, is_new_speaker=False), ] diarizer = MockDiarizationProvider(results=results)
TTS Stream Filters¶
TTSStreamFilter ¶
Bases: ABC
Base class for TTS text filters.
Supports both streaming (chunk-by-chunk via :meth:feed/:meth:flush)
and non-streaming (full text via :meth:__call__) usage.
Subclasses must implement :meth:feed, :meth:flush, and
:meth:reset. The default :meth:__call__ delegates to feed+flush
but subclasses may override it with a more efficient implementation
(e.g. a single regex pass).
StripBrackets ¶
Bases: TTSStreamFilter
Strip all [...] bracketed content from TTS text.
A simpler variant that catches markers like [Respond in French],
[laughs], [thinking], etc.
StripInternalTags ¶
Bases: TTSStreamFilter
Strip [internal]...[/internal] and [internal: ...] blocks.
Handles two formats that AI models commonly produce:
- Paired tags:
[internal]reasoning here[/internal] spoken text - Single bracket:
[internal: reasoning here] spoken text
In streaming mode, buffers text when [internal is detected and
discards everything up to the matching close. Text outside tags is
passed through immediately.
In non-streaming mode (__call__), a single regex removes all
tagged blocks.
Events & Callbacks¶
SpeakerChangeEvent
dataclass
¶
Speaker change detected by diarization.
This event is fired when the audio pipeline's diarization stage detects a different speaker than the previous frame.
BargeInCallback
module-attribute
¶
BargeInCallback = Callable[[VoiceSession], Any]
Callback for barge-in detection: (session).