Skip to content

ai|coustics Denoiser

The AICousticsDenoiserProvider integrates ai|coustics Quail speech enhancement models into the RoomKit audio pipeline. Quail provides neural noise suppression, dereverberation, and speaker isolation (Voice Focus), optimized for STT/ASR accuracy.

Key characteristics:

  • On-device — runs locally via the aic-sdk Python package (Rust + PyO3)
  • CPU-only — no GPU required
  • ~2 ms inference per 10 ms frame
  • 30 ms algorithmic delay

Installation

pip install roomkit[aicoustics]

Set your license key as an environment variable:

export AIC_SDK_LICENSE="your-license-key"

Or pass it directly in the config:

config = AICousticsDenoiserConfig(license_key="your-license-key")

Quick Start

from roomkit.voice.pipeline.denoiser import (
    AICousticsDenoiserConfig,
    AICousticsDenoiserProvider,
)
from roomkit.voice.pipeline import AudioPipelineConfig

denoiser = AICousticsDenoiserProvider(
    AICousticsDenoiserConfig(model="quail-vf-2.0-l-16khz")
)
pipeline = AudioPipelineConfig(denoiser=denoiser)

Model Comparison

Model Size Sample Rate Features Use Case
quail-vf-2.0-l-16khz Large 16 kHz Denoise + dereverb + Voice Focus Best quality for voice AI / STT
quail-2.0-l-16khz Large 16 kHz Denoise + dereverb General enhancement without speaker isolation
quail-2.0-s-16khz Small 16 kHz Denoise + dereverb Lower latency, smaller footprint

The VF (Voice Focus) models include speaker isolation — they suppress competing speakers and keep only the primary talker. This is especially useful for multi-party environments.

Configuration Reference

Parameter Type Default Description
model str "quail-vf-2.0-l-16khz" Model identifier for download
model_dir str "./models" Local cache directory for downloaded models
license_key str "" SDK license key (falls back to AIC_SDK_LICENSE env var)
enhancement_level float 0.8 Enhancement strength from 0.0 to 1.0
num_channels int 1 Audio channels (1 = mono, 2 = stereo)

Enhancement Level Tuning

The enhancement_level parameter controls the aggressiveness of noise removal:

Level Style Best For
0.3–0.5 Conservative Mild background noise, preserves natural ambiance
0.8 Balanced Recommended — best WER for voice AI workloads
1.0 Aggressive Heavy noise environments, may introduce artifacts
# Conservative — light touch
config = AICousticsDenoiserConfig(enhancement_level=0.5)

# Balanced — best for STT accuracy (default)
config = AICousticsDenoiserConfig(enhancement_level=0.8)

# Aggressive — noisy environments
config = AICousticsDenoiserConfig(enhancement_level=1.0)

Full Voice Pipeline Integration

from roomkit import RoomKit, VoiceChannel
from roomkit.voice.backends import LocalAudioBackend
from roomkit.voice.stt import DeepgramSTTProvider
from roomkit.voice.tts import ElevenLabsTTSProvider
from roomkit.voice.pipeline import AudioPipelineConfig
from roomkit.voice.pipeline.denoiser import (
    AICousticsDenoiserConfig,
    AICousticsDenoiserProvider,
)
from roomkit.voice.pipeline.vad import SherpaOnnxVADProvider

kit = RoomKit()

denoiser = AICousticsDenoiserProvider(
    AICousticsDenoiserConfig(
        model="quail-vf-2.0-l-16khz",
        enhancement_level=0.8,
    )
)

pipeline = AudioPipelineConfig(
    denoiser=denoiser,
    vad=SherpaOnnxVADProvider(model_path="silero_vad.onnx"),
)

voice = VoiceChannel(
    "voice",
    backend=LocalAudioBackend(),
    stt=DeepgramSTTProvider(...),
    tts=ElevenLabsTTSProvider(...),
    pipeline_config=pipeline,
)

kit.register_channel(voice)

Performance Notes

  • Inference latency: ~2 ms per 10 ms frame on modern CPUs
  • Algorithmic delay: 30 ms (3 frames at 10 ms)
  • CPU-only: no GPU required; uses Rust backend via PyO3
  • Model download: first call triggers a one-time model download to model_dir
  • Thread-safe: internal locking allows safe concurrent access

The provider buffers incoming audio to match Quail's expected frame size. If your pipeline frame size doesn't match exactly, the provider handles buffering internally — no configuration needed.