STT & TTS Providers¶
RoomKit's voice pipeline uses pluggable Speech-to-Text (STT) and Text-to-Speech (TTS) providers. This guide covers all built-in providers, their configuration, and streaming capabilities.
STT Provider ABC¶
from __future__ import annotations
from roomkit.voice.stt import STTProvider, TranscriptionResult
class STTProvider(ABC):
@property
def name(self) -> str: ...
@property
def supports_streaming(self) -> bool: ...
async def transcribe(self, audio) -> TranscriptionResult:
"""Batch transcription — send all audio, get full text."""
async def transcribe_stream(self, audio_stream) -> AsyncIterator[TranscriptionResult]:
"""Streaming transcription — get partial results in real time."""
async def warmup(self) -> None:
"""Pre-load models (optional)."""
async def close(self) -> None:
"""Release resources."""
TranscriptionResult¶
@dataclass
class TranscriptionResult:
text: str
is_final: bool = True
confidence: float | None = None
language: str | None = None
words: list[dict[str, Any]] = []
is_speech_start: bool = False
Deepgram (Cloud API)¶
The most feature-rich cloud STT provider. Real-time streaming with interim results, keyword boosting, and entity detection.
from __future__ import annotations
from roomkit.voice.stt.deepgram import DeepgramConfig, DeepgramSTTProvider
stt = DeepgramSTTProvider(
config=DeepgramConfig(
api_key="your-api-key",
model="nova-3",
language="en",
punctuate=True,
smart_format=True,
interim_results=True,
endpointing=300, # Silence duration (ms) before endpoint
vad_events=True, # Emit speech_start events
)
)
| Parameter | Default | Description |
|---|---|---|
model |
"nova-3" |
Model name |
language |
"en" |
Language code |
punctuate |
True |
Add punctuation |
smart_format |
True |
Smart formatting (dates, numbers) |
numerals |
False |
Convert numbers to digits |
interim_results |
True |
Include partial results while speaking |
endpointing |
300 |
Silence ms before utterance end, or False to disable |
utterance_end_ms |
None |
Additional utterance end signal |
vad_events |
True |
Emit VAD events |
diarize |
False |
Speaker diarization |
filler_words |
False |
Include "um", "uh" |
keywords |
[] |
Keywords to boost recognition |
keyterm |
[] |
Key terms (Nova-3) |
profanity_filter |
False |
Filter profanity |
redact |
[] |
Redaction rules (e.g., ["pci", "ssn"]) |
detect_entities |
False |
Detect named entities |
Streaming events: SpeechStarted, Results (partial + final), UtteranceEnd
Batch mode: HTTP POST to /listen endpoint.
Streaming mode: WebSocket with real-time partials and finals.
SherpaOnnx (Local/Offline)¶
Run STT locally without API calls using ONNX models. Supports transducer (streaming) and Whisper (batch) modes.
from __future__ import annotations
from roomkit.voice.stt.sherpa_onnx import SherpaOnnxSTTConfig, SherpaOnnxSTTProvider
# Streaming transducer mode
stt = SherpaOnnxSTTProvider(
config=SherpaOnnxSTTConfig(
mode="transducer",
tokens="path/to/tokens.txt",
encoder="path/to/encoder.onnx",
decoder="path/to/decoder.onnx",
joiner="path/to/joiner.onnx",
sample_rate=16000,
num_threads=2,
provider="cpu", # or "cuda"
enable_endpoint_detection=True,
rule1_min_trailing_silence=2.4, # Seconds
rule2_min_trailing_silence=1.2,
rule3_min_utterance_length=20.0,
)
)
# Batch Whisper mode (no streaming)
stt_whisper = SherpaOnnxSTTProvider(
config=SherpaOnnxSTTConfig(
mode="whisper",
tokens="path/to/tokens.txt",
encoder="path/to/encoder.onnx",
decoder="path/to/decoder.onnx",
language="en",
task="transcribe", # or "translate" for English translation
)
)
| Parameter | Default | Description |
|---|---|---|
mode |
required | "transducer" (streaming) or "whisper" (batch only) |
tokens |
required | Path to tokens.txt |
encoder |
required | Path to encoder ONNX model |
decoder |
required | Path to decoder ONNX model |
joiner |
None |
Path to joiner ONNX (transducer only) |
sample_rate |
16000 |
Expected input sample rate |
num_threads |
2 |
CPU threads for inference |
provider |
"cpu" |
ONNX runtime: "cpu" or "cuda" |
enable_endpoint_detection |
True |
Detect utterance endpoints |
rule1_min_trailing_silence |
2.4 |
Silence threshold (seconds) for rule 1 |
rule2_min_trailing_silence |
1.2 |
Silence with text threshold |
rule3_min_utterance_length |
20.0 |
Min utterance length for rule 3 |
Gradium (Cloud API)¶
Cloud STT with built-in server-side VAD and pre-connect buffering to avoid lost first words.
from __future__ import annotations
from roomkit.voice.stt.gradium import GradiumSTTConfig, GradiumSTTProvider
stt = GradiumSTTProvider(
config=GradiumSTTConfig(
api_key="your-api-key",
region="us",
model_name="default",
input_format="pcm",
language="en",
connect_buffer_ms=300, # Buffer audio before WebSocket opens
delay_in_frames=7, # Processing delay (7-48, each = 80ms)
vad_threshold=0.9, # VAD inactivity threshold
vad_steps=10, # Steps above threshold to confirm end
timeout_s=3.0, # Server inactivity timeout
)
)
Streaming events: text (partial), end_text (segment done), step (VAD heartbeat)
Pre-connect buffering: Accumulates real audio before opening the WebSocket, then sends a burst — avoids losing the first few words.
Qwen3 ASR (Local/GPU)¶
HuggingFace-based ASR with optional vLLM backend for streaming.
from __future__ import annotations
from roomkit.voice.stt.qwen3 import Qwen3ASRConfig, Qwen3ASRProvider
stt = Qwen3ASRProvider(
config=Qwen3ASRConfig(
model_id="Qwen/Qwen3-ASR-0.6B",
backend="vllm", # "vllm" for streaming, "transformers" for batch
device_map="auto",
dtype="bfloat16",
language=None, # None = auto-detect
chunk_size_sec=2.0, # Streaming chunk duration
gpu_memory_utilization=0.3,
max_new_tokens=2048,
)
)
| Parameter | Default | Description |
|---|---|---|
model_id |
"Qwen/Qwen3-ASR-0.6B" |
HuggingFace model ID |
backend |
"transformers" |
"transformers" (batch) or "vllm" (batch + streaming) |
dtype |
"bfloat16" |
Model precision |
language |
None |
Language code or None for auto-detect |
chunk_size_sec |
2.0 |
Streaming chunk duration (vLLM only) |
gpu_memory_utilization |
0.3 |
GPU memory fraction (vLLM only) |
TTS Provider ABC¶
from __future__ import annotations
from roomkit.voice.tts import AudioChunk, AudioContent, TTSProvider
class TTSProvider(ABC):
@property
def name(self) -> str: ...
@property
def default_voice(self) -> str | None: ...
@property
def supports_streaming_input(self) -> bool: ...
async def synthesize(self, text, *, voice=None) -> AudioContent:
"""Batch synthesis — full text in, complete audio out."""
async def synthesize_stream(self, text, *, voice=None) -> AsyncIterator[AudioChunk]:
"""Streaming output — yields audio chunks as they're generated."""
async def synthesize_stream_input(self, text_stream, *, voice=None) -> AsyncIterator[AudioChunk]:
"""Streaming input — accepts async text stream, yields audio."""
async def warmup(self) -> None:
"""Pre-load models (optional)."""
async def close(self) -> None:
"""Release resources."""
AudioChunk¶
@dataclass
class AudioChunk:
data: bytes
sample_rate: int = 16000
channels: int = 1
format: str = "pcm_s16le"
timestamp_ms: int | None = None
is_final: bool = False
ElevenLabs (Cloud API)¶
High-quality cloud TTS with streaming input support — starts speaking while the AI is still generating text.
from __future__ import annotations
from roomkit.voice.tts.elevenlabs import ElevenLabsConfig, ElevenLabsTTSProvider
tts = ElevenLabsTTSProvider(
config=ElevenLabsConfig(
api_key="your-api-key",
voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel
model_id="eleven_multilingual_v2",
stability=0.5,
similarity_boost=0.75,
style=0.0,
use_speaker_boost=True,
output_format="mp3_44100_128",
optimize_streaming_latency=3, # 0-4, higher = faster
)
)
# List available voices
voices = await tts.list_voices()
for v in voices:
print(f"{v['voice_id']}: {v['name']} ({v['category']})")
| Parameter | Default | Description |
|---|---|---|
voice_id |
"21m00Tcm4TlvDq8ikWAM" |
Voice ID (Rachel) |
model_id |
"eleven_multilingual_v2" |
TTS model |
stability |
0.5 |
Voice stability (0–1) |
similarity_boost |
0.75 |
Voice similarity (0–1) |
style |
0.0 |
Style exaggeration (0–1) |
output_format |
"mp3_44100_128" |
Output format |
optimize_streaming_latency |
3 |
Latency optimization level (0–4) |
Three synthesis modes:
synthesize()— Batch: returns complete audio as base64 data URLsynthesize_stream()— Streaming output: yields audio chunks via HTTPsynthesize_stream_input()— Streaming input: WebSocket accepts async text, yields audio in real time
SherpaOnnx TTS (Local/Offline)¶
Local TTS using VITS/Piper ONNX models. No API calls required.
from __future__ import annotations
from roomkit.voice.tts.sherpa_onnx import SherpaOnnxTTSConfig, SherpaOnnxTTSProvider
tts = SherpaOnnxTTSProvider(
config=SherpaOnnxTTSConfig(
model="path/to/model.onnx",
tokens="path/to/tokens.txt",
data_dir="path/to/espeak-ng-data", # For Piper models
speaker_id=0, # Multi-speaker models
speed=1.0, # < 1.0 = faster, > 1.0 = slower
sample_rate=22050,
num_threads=2,
provider="cpu", # or "cuda"
)
)
Text splitting: Automatic sentence-based chunking (max 300 chars per chunk) with short-fragment merging.
Qwen3 TTS (Local/GPU, Voice Cloning)¶
LLM-based TTS with zero-shot voice cloning from reference audio.
from __future__ import annotations
from roomkit.voice.tts.qwen3 import Qwen3TTSConfig, Qwen3TTSProvider, VoiceCloneConfig
tts = Qwen3TTSProvider(
config=Qwen3TTSConfig(
model_id="Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="auto",
dtype="bfloat16",
language="English",
voices={
"default": VoiceCloneConfig(
ref_audio="reference.wav", # 3s+ clean speech
ref_text="Transcript of the reference audio.",
),
"french": VoiceCloneConfig(
ref_audio="french_ref.wav",
ref_text="Transcription de l'audio de reference.",
),
},
temperature=0.6,
top_p=0.8,
repetition_penalty=1.05,
max_new_tokens=4096,
)
)
# Pre-load model and encode reference audio
await tts.warmup()
| Parameter | Default | Description |
|---|---|---|
model_id |
"Qwen/Qwen3-TTS-12Hz-1.7B-Base" |
HuggingFace model |
voices |
{} |
Voice name → VoiceCloneConfig mapping |
language |
"English" |
Default synthesis language |
temperature |
0.6 |
Sampling temperature |
top_p |
0.8 |
Nucleus sampling probability |
max_new_tokens |
4096 |
Max output tokens |
Voice cloning: Provide a 3+ second reference WAV and its transcript. The model learns the voice characteristics at warmup time.
Output: Fixed 24kHz PCM.
Gradium TTS (Cloud API)¶
Cloud TTS with streaming input support and fine-grained voice control.
from __future__ import annotations
from roomkit.voice.tts.gradium import GradiumTTSConfig, GradiumTTSProvider
tts = GradiumTTSProvider(
config=GradiumTTSConfig(
api_key="your-api-key",
voice_id="default",
region="us",
model_name="default",
output_format="pcm_16000",
temperature=0.7,
cfg_coef=2.0, # Voice similarity (1.0–4.0)
padding_bonus=0.0, # Speed: negative = faster, positive = slower
rewrite_rules="en", # Language-specific text rewriting
)
)
NeuTTS (Local/GPU, Voice Cloning)¶
GGUF-quantized LLM-based TTS with native streaming and voice cloning.
from __future__ import annotations
from roomkit.voice.tts.neutts import NeuTTSConfig, NeuTTSProvider, NeuTTSVoiceConfig
tts = NeuTTSProvider(
config=NeuTTSConfig(
backbone_repo="neuphonic/neutts-nano-french-q8-gguf",
codec_repo="neuphonic/neucodec",
device="cpu", # or "cuda"
voices={
"default": NeuTTSVoiceConfig(
ref_audio="reference.wav", # 3-15s, 16kHz mono
ref_text="Transcript of reference audio.",
),
},
streaming_pre_buffer=2, # Chunks to buffer before yielding
)
)
Pre-buffering: On CPU, accumulates 2 chunks (~1 second) before yielding to prevent playback underruns when inference is slower than real-time.
Output: Fixed 24kHz PCM.
TTS Filters¶
Filters clean AI-generated text before it reaches the TTS provider. Essential for removing reasoning markers, annotations, or bracketed instructions.
StripInternalTags¶
Removes [internal]...[/internal] and [internal: ...] blocks — useful when the AI includes reasoning that shouldn't be spoken.
from __future__ import annotations
from roomkit.voice.tts.filters import StripInternalTags
f = StripInternalTags()
# Non-streaming (full text)
clean = f("[internal]Let me think about this...[/internal] Here's what I found.")
# → "Here's what I found."
# Streaming (token by token)
for token in ["[internal", "]thinking[/", "internal] The answer", " is 42."]:
result = f.feed(token)
if result:
print(result, end="")
print(f.flush())
# → "The answer is 42."
StripBrackets¶
Removes all [...] bracketed content — catches [laughs], [pause], [Respond in French], etc.
from __future__ import annotations
from roomkit.voice.tts.filters import StripBrackets
f = StripBrackets()
clean = f("Sure [laughs] I can help [pause] with that.")
# → "Sure I can help with that."
Using Filters with Streaming TTS¶
from __future__ import annotations
from roomkit.voice.tts.filters import StripInternalTags, filtered_stream
async def ai_token_stream():
"""Simulated AI output with internal reasoning."""
for token in ["[internal]", "reasoning", "[/internal]", " Hello", " there!"]:
yield token
# Wrap the token stream through a filter before TTS
clean_stream = filtered_stream(ai_token_stream(), StripInternalTags())
async for chunk in tts.synthesize_stream_input(clean_stream, voice="default"):
# Audio chunks without the internal reasoning
transport.send_audio(session, chunk)
Sentence Splitter¶
Buffers streaming tokens and yields complete sentences — prevents unnatural pauses from very short fragments.
from __future__ import annotations
from roomkit.voice.tts.sentence_splitter import split_sentences
# Buffer tokens until sentence boundaries
async for sentence in split_sentences(ai_token_stream(), min_chunk_chars=20):
async for chunk in tts.synthesize_stream(sentence, voice="default"):
transport.send_audio(session, chunk)
Choosing a Provider¶
| Provider | Type | Streaming | Latency | Cost | Best For |
|---|---|---|---|---|---|
| Deepgram | Cloud STT | Yes | Low | Per-minute | Production real-time transcription |
| Gradium | Cloud STT | Yes | Low | Per-minute | Real-time with server-side VAD |
| SherpaOnnx | Local STT | Transducer only | Medium | Free | Privacy, offline, edge |
| Qwen3 ASR | Local STT | vLLM only | Medium | Free | GPU-accelerated, multilingual |
| ElevenLabs | Cloud TTS | Yes + input | Low | Per-character | Highest voice quality |
| Gradium | Cloud TTS | Yes + input | Low | Per-character | Real-time with voice control |
| SherpaOnnx | Local TTS | Yes | Medium | Free | Privacy, offline, VITS/Piper |
| Qwen3 TTS | Local TTS | Post-gen | Medium | Free | Voice cloning, GPU |
| NeuTTS | Local TTS | GGUF only | Medium | Free | Voice cloning, GGUF quantized |
Using with VoiceChannel¶
from __future__ import annotations
from roomkit.channels import VoiceChannel
from roomkit.voice.backends.mock import MockVoiceBackend
from roomkit.voice.pipeline import AudioPipelineConfig
from roomkit.voice.stt.deepgram import DeepgramConfig, DeepgramSTTProvider
from roomkit.voice.tts.elevenlabs import ElevenLabsConfig, ElevenLabsTTSProvider
stt = DeepgramSTTProvider(config=DeepgramConfig(api_key="..."))
tts = ElevenLabsTTSProvider(config=ElevenLabsConfig(api_key="..."))
voice = VoiceChannel(
"voice-main",
stt=stt,
tts=tts,
backend=MockVoiceBackend(),
pipeline=AudioPipelineConfig(),
)
kit.register_channel(voice)
Testing with Mocks¶
from __future__ import annotations
from roomkit.voice.stt.mock import MockSTTProvider
from roomkit.voice.tts.mock import MockTTSProvider
stt = MockSTTProvider(transcripts=["Hello", "How are you?"], streaming=False)
tts = MockTTSProvider(voice="mock-voice")
# After usage:
assert len(stt.calls) == 1 # Audio inputs received
assert len(tts.calls) == 1 # Synthesis requests made
assert tts.calls[0]["text"] == "Hello!" # Text synthesized