How I integrated Gradium's audio language models into RoomKit for multi-channel voice AI
This article shows how I integrated Gradium as an STT and TTS provider inside RoomKit, my open-source Python framework for multi-channel conversation orchestration. If you're building voice AI agents in Python and want top-tier voice quality with low latency, this is for you.
What is Gradium?
Gradium builds Audio Language Models (ALMs), not just wrappers around existing speech models, but purpose-built neural architectures for voice. Their API exposes two main services:
- Speech-to-Text (STT): Streaming transcription via WebSocket at 24kHz, with a semantic Voice Activity Detection (VAD) that tells you probabilistically when the speaker is done talking. No more arbitrary silence thresholds.
- Text-to-Speech (TTS): Streaming synthesis at 48kHz with sub-300ms time-to-first-token. Over 150 voices across English, French, German, Spanish, and Portuguese. Instant voice cloning from a 10-second sample.
The key differentiator? Their VAD isn't a simple energy detector: it returns inactivity_prob values at multiple time horizons (0.5s, 1.0s, 2.0s), letting you make intelligent turn-taking decisions. This is something I hadn't seen elsewhere, and it maps beautifully to how real conversations work.
The Python SDK is async-native and straightforward:
pip install gradium
import gradium
client = gradium.client.GradiumClient() # uses GRADIUM_API_KEY env var
# TTS - generate speech
audio = await client.tts(
setup={"voice_id": "axlOaUiFyOZhy4nv", "output_format": "pcm"},
text="Bonjour, comment puis-je vous aider?"
)
# TTS - streaming
stream = await client.tts_stream(
setup={"voice_id": "axlOaUiFyOZhy4nv", "output_format": "pcm"},
text=my_async_text_generator()
)
async for chunk in stream.iter_bytes():
# process audio chunks as they arrive
pass
For STT, the API uses WebSocket streaming with base64-encoded audio:
stream = await client.stt_stream(
{"model_name": "default", "input_format": "pcm"},
audio_generator(audio_data),
)
async for message in stream.iter_text():
print(message) # transcribed text, as it comes
What is RoomKit?
RoomKit is a pure async Python library I built for multi-channel conversations. The core idea: a room is a conversation. Channels are how people and AI interact with that room: SMS, Email, WhatsApp, Voice, WebSocket, AI, and more.
Inbound ──► Hook pipeline ──► Store ──► Broadcast to all channels
│
┌──────────┬──────────┬────────┬─────┼─────┐
▼ ▼ ▼ ▼ ▼ ▼
SMS WhatsApp Email Voice WS AI
For voice, RoomKit has a pluggable provider system. The architecture separates concerns cleanly:
- Voice Backends handle audio transport (WebRTC via FastRTC, RTP, local mic)
- STT Providers convert audio to text
- TTS Providers convert text to audio
- Audio Pipeline sits between them with VAD, echo cancellation, denoising, recording
The voice channel's audio pipeline looks like this:
Inbound: Backend → Resampler → Recorder → AEC → AGC → Denoiser → VAD → STT Outbound: TTS → PostProcessors → Recorder → AEC.feed_reference → Resampler → Backend
RoomKit ships with providers for Deepgram (STT), ElevenLabs (TTS), and SherpaOnnx (local STT/TTS). Adding Gradium means implementing two interfaces: BaseSTTProvider and BaseTTSProvider.
Building the Gradium STT Provider
RoomKit's STT provider interface is simple. You need to handle streaming audio in and emit transcription events. Here's the core of the Gradium STT integration:
from roomkit.voice.stt.base import BaseSTTProvider, STTResult
class GradiumSTTProvider(BaseSTTProvider):
"""Gradium Speech-to-Text provider for RoomKit."""
def __init__(
self,
api_key: str | None = None,
model_name: str = "default",
language: str = "fr",
endpoint: str = "wss://eu.api.gradium.ai/api/speech/asr",
vad_threshold: float = 0.5,
):
self.api_key = api_key or os.environ["GRADIUM_API_KEY"]
self.model_name = model_name
self.language = language
self.endpoint = endpoint
self.vad_threshold = vad_threshold
self._ws = None
async def start(self) -> None:
"""Open WebSocket and send setup message."""
self._ws = await websockets.connect(
self.endpoint,
additional_headers={"x-api-key": self.api_key},
)
setup = {
"type": "setup",
"model_name": self.model_name,
"input_format": "pcm",
"json_config": {"language": self.language},
}
await self._ws.send(json.dumps(setup))
ready = json.loads(await self._ws.recv())
assert ready["type"] == "ready"
self._sample_rate = ready.get("sample_rate", 24000)
self._frame_size = ready.get("frame_size", 1920)
async def process_audio(self, audio_chunk: bytes) -> list[STTResult]:
"""Send audio and collect transcription results."""
results = []
# Send audio frame
audio_b64 = base64.b64encode(audio_chunk).decode()
await self._ws.send(json.dumps({
"type": "audio",
"audio": audio_b64,
}))
# Non-blocking receive of any pending messages
while True:
try:
msg = json.loads(await asyncio.wait_for(
self._ws.recv(), timeout=0.01
))
except asyncio.TimeoutError:
break
if msg["type"] == "text":
results.append(STTResult(
text=msg["text"],
is_final=False,
start_time=msg.get("start_s"),
))
elif msg["type"] == "step":
# Gradium's semantic VAD
vad_probs = msg.get("vad", [])
if len(vad_probs) >= 3:
inactivity = vad_probs[2].get("inactivity_prob", 0)
if inactivity > self.vad_threshold:
results.append(STTResult(
text="",
is_final=True,
metadata={"vad_inactivity": inactivity},
))
return results
async def stop(self) -> None:
if self._ws:
await self._ws.send(json.dumps({"type": "end_of_stream"}))
await self._ws.close()
The magic here is the VAD integration. Gradium sends step messages every 80ms with inactivity probabilities at 0.5s, 1.0s, and 2.0s horizons. When the 2-second horizon probability crosses the threshold, we emit a is_final=True result, which RoomKit's voice channel interprets as "the user is done speaking, send the accumulated text to the AI."
This is far more natural than the typical approach of "wait for N milliseconds of silence."
Building the Gradium TTS Provider
The TTS provider needs to accept text and return audio chunks. Gradium's streaming TTS fits perfectly:
from roomkit.voice.tts.base import BaseTTSProvider, TTSResult
class GradiumTTSProvider(BaseTTSProvider):
"""Gradium Text-to-Speech provider for RoomKit."""
def __init__(
self,
api_key: str | None = None,
voice_id: str = "axlOaUiFyOZhy4nv", # Leo - French male
model_name: str = "default",
endpoint: str = "wss://eu.api.gradium.ai/api/speech/tts",
output_format: str = "pcm",
speed: float = 0.0,
):
self.api_key = api_key or os.environ["GRADIUM_API_KEY"]
self.voice_id = voice_id
self.model_name = model_name
self.endpoint = endpoint
self.output_format = output_format
self.speed = speed
@property
def sample_rate(self) -> int:
return 48000 # Gradium outputs at 48kHz
async def synthesize(self, text: str) -> AsyncIterator[bytes]:
"""Stream TTS audio chunks."""
async with websockets.connect(
self.endpoint,
additional_headers={"x-api-key": self.api_key},
) as ws:
# Setup
setup = {
"type": "setup",
"model_name": self.model_name,
"voice_id": self.voice_id,
"output_format": self.output_format,
}
if self.speed != 0.0:
setup["json_config"] = {"padding_bonus": self.speed}
await ws.send(json.dumps(setup))
ready = json.loads(await ws.recv())
assert ready["type"] == "ready"
# Send text
await ws.send(json.dumps({"type": "text", "text": text}))
await ws.send(json.dumps({"type": "end_of_stream"}))
# Stream audio
async for msg_raw in ws:
msg = json.loads(msg_raw)
if msg["type"] == "audio":
yield base64.b64decode(msg["audio"])
elif msg["type"] == "end_of_stream":
break
Two things worth noting:
1. 48kHz output: Gradium produces high-quality 48kHz audio. RoomKit's audio pipeline has a built-in resampler, so if your voice backend expects 16kHz or 8kHz (common for telephony), it's handled automatically.
2. Streaming input: For LLM-powered agents, the text comes in token by token. Instead of waiting for the full response, you can feed Gradium's TTS with an async generator. The <flush> tag forces audio generation for buffered text:
async def stream_llm_to_tts(llm_stream):
"""Feed LLM tokens directly into Gradium TTS."""
buffer = ""
async for token in llm_stream:
buffer += token
# Flush on sentence boundaries for natural pacing
if token.rstrip().endswith((".", "!", "?", ":")):
yield buffer + " <flush> "
buffer = ""
if buffer:
yield buffer
Wiring It All Together
Here's a complete example: a French-speaking voice AI agent using Gradium for STT/TTS, Claude for intelligence, accessible via WebRTC, all in under 50 lines of application code:
import asyncio
from roomkit import RoomKit, ChannelCategory
from roomkit.channels.voice import VoiceChannel
from roomkit.channels.ai import AIChannel
from roomkit.providers.ai import AnthropicAIProvider
# Import our Gradium providers
from my_providers import GradiumSTTProvider, GradiumTTSProvider
async def main():
kit = RoomKit()
# Configure Gradium
stt = GradiumSTTProvider(
language="fr",
endpoint="wss://eu.api.gradium.ai/api/speech/asr",
)
tts = GradiumTTSProvider(
voice_id="axlOaUiFyOZhy4nv", # Leo - French male voice
endpoint="wss://eu.api.gradium.ai/api/speech/tts",
)
# Create channels
voice = VoiceChannel("phone", stt_provider=stt, tts_provider=tts)
ai = AIChannel("claude", provider=AnthropicAIProvider(
model="claude-sonnet-4-20250514",
system_prompt="Tu es un assistant francophone. Réponds en français.",
))
kit.register_channel(voice)
kit.register_channel(ai)
# Create room and wire everything
await kit.create_room(room_id="conversation-1")
await kit.attach_channel("conversation-1", "phone")
await kit.attach_channel("conversation-1", "claude",
category=ChannelCategory.INTELLIGENCE)
# The voice channel handles the rest:
# Audio in → STT (Gradium) → Text → AI (Claude) → Text → TTS (Gradium) → Audio out
print("Voice agent ready. Listening...")
await asyncio.Event().wait()
asyncio.run(main())
That's it. The audio flows through Gradium's STT, the transcribed text goes to Claude, Claude's response goes through Gradium's TTS, and the audio goes back to the caller. RoomKit handles the orchestration, the audio pipeline (VAD, echo cancellation, denoising), and the event lifecycle.
And because this is RoomKit, you can add SMS or WhatsApp to the same room. If the call drops, the conversation continues on another channel:
from roomkit.channels.sms import SMSChannel
from roomkit.providers.sms import TwilioSMSProvider
sms = SMSChannel("sms-fallback", provider=TwilioSMSProvider(...))
kit.register_channel(sms)
await kit.attach_channel("conversation-1", "sms-fallback")
Why Gradium Stands Out
After testing several STT/TTS providers for voice AI, here's what makes Gradium different:
The semantic VAD is a game-changer. Most STT services give you raw transcripts and leave turn detection to you. Gradium's VAD returns probability-based inactivity predictions at multiple horizons. This means your agent doesn't cut people off mid-sentence, and it doesn't wait awkwardly after they've clearly finished speaking. In my testing, the inactivity_prob > 0.5 threshold on the 2-second horizon gives remarkably natural turn-taking, better than anything I've achieved with Silero VAD + arbitrary silence timeouts.
Multilingual quality is native, not bolted on. Gradium supports English, French, German, Spanish, and Portuguese out of the box, with 150+ voices. For my use case, the French voices, including Canadian French variants, are noticeably better than competitors in terms of pronunciation and natural flow. But the same quality applies across all supported languages.
The streaming architecture is right. WebSocket-based streaming for both STT and TTS means you can pipeline everything. Audio comes in, gets transcribed as it arrives, the LLM starts generating while the user is still finishing their sentence, and TTS starts producing audio from the first tokens. The result is perceived latency that's dramatically lower than batch-process approaches.
Production-ready controls. The padding_bonus parameter for speed (-4.0 to +4.0), the temp parameter for vocal variety, and the rewrite rules for language-specific number/date/URL pronunciation are small things that make a big difference in production. When your voice agent reads back a phone number with proper locale-specific grouping instead of digit-by-digit, it sounds professional.
Practical Tips
A few lessons learned from the integration:
1. Sample rate management: Gradium STT expects 24kHz input, TTS outputs 48kHz. If your voice backend runs at 16kHz (common for telephony), configure RoomKit's resampler in the audio pipeline. The framework handles this, but be explicit about it.
2. Connection pooling: Gradium's WebSocket sessions last up to 300 seconds. For long conversations, implement reconnection logic. RoomKit's circuit breaker pattern helps here: if the WebSocket drops, the circuit opens, messages are queued, and reconnection happens with exponential backoff.
3. EU vs US endpoints: Gradium has servers in both Europe (eu.api.gradium.ai) and the US (us.api.gradium.ai). Choose based on your users' location for lowest latency. The EU endpoint also helps with data residency compliance if that matters for your industry.
4. Voice cloning for brand identity: Gradium's instant voice cloning (10-second sample) lets you create a branded voice for your agent. Combined with RoomKit's per-room AI configuration, you can have different agents with different voices in different rooms.
5. The <flush> tag: When streaming LLM output to TTS, use <flush> at natural sentence boundaries. The model buffers text for context before generating audio, so flushing at the right moments keeps latency low while maintaining natural prosody.
What's Next
RoomKit is open source (MIT licensed) and actively looking for contributors. The Gradium provider integration is a great example of how the pluggable architecture works, and I'd love to see the community build providers for other services.
If you're building voice AI in Python, here are the links:
- RoomKit: github.com/roomkit-live/roomkit | roomkit.live
- Gradium: gradium.ai | API Docs
- Gradium Python SDK:
pip install gradium
Star the repos, try the quickstart, open an issue. The voice AI space is moving fast, and having providers with native multilingual support and solid streaming APIs makes the developer experience significantly better.