Back to blog

Gradium + RoomKit: Building Voice AI Agents with Ultra-Low Latency STT/TTS

February 9, 2026 · 15 min read

How I integrated Gradium's audio language models into RoomKit for multi-channel voice AI


This article shows how I integrated Gradium as an STT and TTS provider inside RoomKit, my open-source Python framework for multi-channel conversation orchestration. If you're building voice AI agents in Python and want top-tier voice quality with low latency, this is for you.


What is Gradium?

Gradium builds Audio Language Models (ALMs), not just wrappers around existing speech models, but purpose-built neural architectures for voice. Their API exposes two main services:

The key differentiator? Their VAD isn't a simple energy detector: it returns inactivity_prob values at multiple time horizons (0.5s, 1.0s, 2.0s), letting you make intelligent turn-taking decisions. This is something I hadn't seen elsewhere, and it maps beautifully to how real conversations work.

The Python SDK is async-native and straightforward:

pip install gradium
import gradium

client = gradium.client.GradiumClient()  # uses GRADIUM_API_KEY env var

# TTS - generate speech
audio = await client.tts(
    setup={"voice_id": "axlOaUiFyOZhy4nv", "output_format": "pcm"},
    text="Bonjour, comment puis-je vous aider?"
)

# TTS - streaming
stream = await client.tts_stream(
    setup={"voice_id": "axlOaUiFyOZhy4nv", "output_format": "pcm"},
    text=my_async_text_generator()
)
async for chunk in stream.iter_bytes():
    # process audio chunks as they arrive
    pass

For STT, the API uses WebSocket streaming with base64-encoded audio:

stream = await client.stt_stream(
    {"model_name": "default", "input_format": "pcm"},
    audio_generator(audio_data),
)
async for message in stream.iter_text():
    print(message)  # transcribed text, as it comes

What is RoomKit?

RoomKit is a pure async Python library I built for multi-channel conversations. The core idea: a room is a conversation. Channels are how people and AI interact with that room: SMS, Email, WhatsApp, Voice, WebSocket, AI, and more.

Inbound ──► Hook pipeline ──► Store ──► Broadcast to all channels
                                             │
        ┌──────────┬──────────┬────────┬─────┼─────┐
        ▼          ▼          ▼        ▼     ▼     ▼
      SMS       WhatsApp    Email    Voice   WS     AI

For voice, RoomKit has a pluggable provider system. The architecture separates concerns cleanly:

The voice channel's audio pipeline looks like this:

Inbound:  Backend → Resampler → Recorder → AEC → AGC → Denoiser → VAD → STT
Outbound: TTS → PostProcessors → Recorder → AEC.feed_reference → Resampler → Backend

RoomKit ships with providers for Deepgram (STT), ElevenLabs (TTS), and SherpaOnnx (local STT/TTS). Adding Gradium means implementing two interfaces: BaseSTTProvider and BaseTTSProvider.


Building the Gradium STT Provider

RoomKit's STT provider interface is simple. You need to handle streaming audio in and emit transcription events. Here's the core of the Gradium STT integration:

from roomkit.voice.stt.base import BaseSTTProvider, STTResult

class GradiumSTTProvider(BaseSTTProvider):
    """Gradium Speech-to-Text provider for RoomKit."""

    def __init__(
        self,
        api_key: str | None = None,
        model_name: str = "default",
        language: str = "fr",
        endpoint: str = "wss://eu.api.gradium.ai/api/speech/asr",
        vad_threshold: float = 0.5,
    ):
        self.api_key = api_key or os.environ["GRADIUM_API_KEY"]
        self.model_name = model_name
        self.language = language
        self.endpoint = endpoint
        self.vad_threshold = vad_threshold
        self._ws = None

    async def start(self) -> None:
        """Open WebSocket and send setup message."""
        self._ws = await websockets.connect(
            self.endpoint,
            additional_headers={"x-api-key": self.api_key},
        )
        setup = {
            "type": "setup",
            "model_name": self.model_name,
            "input_format": "pcm",
            "json_config": {"language": self.language},
        }
        await self._ws.send(json.dumps(setup))
        ready = json.loads(await self._ws.recv())
        assert ready["type"] == "ready"
        self._sample_rate = ready.get("sample_rate", 24000)
        self._frame_size = ready.get("frame_size", 1920)

    async def process_audio(self, audio_chunk: bytes) -> list[STTResult]:
        """Send audio and collect transcription results."""
        results = []

        # Send audio frame
        audio_b64 = base64.b64encode(audio_chunk).decode()
        await self._ws.send(json.dumps({
            "type": "audio",
            "audio": audio_b64,
        }))

        # Non-blocking receive of any pending messages
        while True:
            try:
                msg = json.loads(await asyncio.wait_for(
                    self._ws.recv(), timeout=0.01
                ))
            except asyncio.TimeoutError:
                break

            if msg["type"] == "text":
                results.append(STTResult(
                    text=msg["text"],
                    is_final=False,
                    start_time=msg.get("start_s"),
                ))
            elif msg["type"] == "step":
                # Gradium's semantic VAD
                vad_probs = msg.get("vad", [])
                if len(vad_probs) >= 3:
                    inactivity = vad_probs[2].get("inactivity_prob", 0)
                    if inactivity > self.vad_threshold:
                        results.append(STTResult(
                            text="",
                            is_final=True,
                            metadata={"vad_inactivity": inactivity},
                        ))

        return results

    async def stop(self) -> None:
        if self._ws:
            await self._ws.send(json.dumps({"type": "end_of_stream"}))
            await self._ws.close()

The magic here is the VAD integration. Gradium sends step messages every 80ms with inactivity probabilities at 0.5s, 1.0s, and 2.0s horizons. When the 2-second horizon probability crosses the threshold, we emit a is_final=True result, which RoomKit's voice channel interprets as "the user is done speaking, send the accumulated text to the AI."

This is far more natural than the typical approach of "wait for N milliseconds of silence."


Building the Gradium TTS Provider

The TTS provider needs to accept text and return audio chunks. Gradium's streaming TTS fits perfectly:

from roomkit.voice.tts.base import BaseTTSProvider, TTSResult

class GradiumTTSProvider(BaseTTSProvider):
    """Gradium Text-to-Speech provider for RoomKit."""

    def __init__(
        self,
        api_key: str | None = None,
        voice_id: str = "axlOaUiFyOZhy4nv",  # Leo - French male
        model_name: str = "default",
        endpoint: str = "wss://eu.api.gradium.ai/api/speech/tts",
        output_format: str = "pcm",
        speed: float = 0.0,
    ):
        self.api_key = api_key or os.environ["GRADIUM_API_KEY"]
        self.voice_id = voice_id
        self.model_name = model_name
        self.endpoint = endpoint
        self.output_format = output_format
        self.speed = speed

    @property
    def sample_rate(self) -> int:
        return 48000  # Gradium outputs at 48kHz

    async def synthesize(self, text: str) -> AsyncIterator[bytes]:
        """Stream TTS audio chunks."""
        async with websockets.connect(
            self.endpoint,
            additional_headers={"x-api-key": self.api_key},
        ) as ws:
            # Setup
            setup = {
                "type": "setup",
                "model_name": self.model_name,
                "voice_id": self.voice_id,
                "output_format": self.output_format,
            }
            if self.speed != 0.0:
                setup["json_config"] = {"padding_bonus": self.speed}

            await ws.send(json.dumps(setup))
            ready = json.loads(await ws.recv())
            assert ready["type"] == "ready"

            # Send text
            await ws.send(json.dumps({"type": "text", "text": text}))
            await ws.send(json.dumps({"type": "end_of_stream"}))

            # Stream audio
            async for msg_raw in ws:
                msg = json.loads(msg_raw)
                if msg["type"] == "audio":
                    yield base64.b64decode(msg["audio"])
                elif msg["type"] == "end_of_stream":
                    break

Two things worth noting:

1. 48kHz output: Gradium produces high-quality 48kHz audio. RoomKit's audio pipeline has a built-in resampler, so if your voice backend expects 16kHz or 8kHz (common for telephony), it's handled automatically.

2. Streaming input: For LLM-powered agents, the text comes in token by token. Instead of waiting for the full response, you can feed Gradium's TTS with an async generator. The <flush> tag forces audio generation for buffered text:

async def stream_llm_to_tts(llm_stream):
    """Feed LLM tokens directly into Gradium TTS."""
    buffer = ""
    async for token in llm_stream:
        buffer += token
        # Flush on sentence boundaries for natural pacing
        if token.rstrip().endswith((".", "!", "?", ":")):
            yield buffer + " <flush> "
            buffer = ""
    if buffer:
        yield buffer

Wiring It All Together

Here's a complete example: a French-speaking voice AI agent using Gradium for STT/TTS, Claude for intelligence, accessible via WebRTC, all in under 50 lines of application code:

import asyncio
from roomkit import RoomKit, ChannelCategory
from roomkit.channels.voice import VoiceChannel
from roomkit.channels.ai import AIChannel
from roomkit.providers.ai import AnthropicAIProvider

# Import our Gradium providers
from my_providers import GradiumSTTProvider, GradiumTTSProvider

async def main():
    kit = RoomKit()

    # Configure Gradium
    stt = GradiumSTTProvider(
        language="fr",
        endpoint="wss://eu.api.gradium.ai/api/speech/asr",
    )
    tts = GradiumTTSProvider(
        voice_id="axlOaUiFyOZhy4nv",  # Leo - French male voice
        endpoint="wss://eu.api.gradium.ai/api/speech/tts",
    )

    # Create channels
    voice = VoiceChannel("phone", stt_provider=stt, tts_provider=tts)
    ai = AIChannel("claude", provider=AnthropicAIProvider(
        model="claude-sonnet-4-20250514",
        system_prompt="Tu es un assistant francophone. Réponds en français.",
    ))

    kit.register_channel(voice)
    kit.register_channel(ai)

    # Create room and wire everything
    await kit.create_room(room_id="conversation-1")
    await kit.attach_channel("conversation-1", "phone")
    await kit.attach_channel("conversation-1", "claude",
                             category=ChannelCategory.INTELLIGENCE)

    # The voice channel handles the rest:
    # Audio in → STT (Gradium) → Text → AI (Claude) → Text → TTS (Gradium) → Audio out
    print("Voice agent ready. Listening...")
    await asyncio.Event().wait()

asyncio.run(main())

That's it. The audio flows through Gradium's STT, the transcribed text goes to Claude, Claude's response goes through Gradium's TTS, and the audio goes back to the caller. RoomKit handles the orchestration, the audio pipeline (VAD, echo cancellation, denoising), and the event lifecycle.

And because this is RoomKit, you can add SMS or WhatsApp to the same room. If the call drops, the conversation continues on another channel:

from roomkit.channels.sms import SMSChannel
from roomkit.providers.sms import TwilioSMSProvider

sms = SMSChannel("sms-fallback", provider=TwilioSMSProvider(...))
kit.register_channel(sms)
await kit.attach_channel("conversation-1", "sms-fallback")

Why Gradium Stands Out

After testing several STT/TTS providers for voice AI, here's what makes Gradium different:

The semantic VAD is a game-changer. Most STT services give you raw transcripts and leave turn detection to you. Gradium's VAD returns probability-based inactivity predictions at multiple horizons. This means your agent doesn't cut people off mid-sentence, and it doesn't wait awkwardly after they've clearly finished speaking. In my testing, the inactivity_prob > 0.5 threshold on the 2-second horizon gives remarkably natural turn-taking, better than anything I've achieved with Silero VAD + arbitrary silence timeouts.

Multilingual quality is native, not bolted on. Gradium supports English, French, German, Spanish, and Portuguese out of the box, with 150+ voices. For my use case, the French voices, including Canadian French variants, are noticeably better than competitors in terms of pronunciation and natural flow. But the same quality applies across all supported languages.

The streaming architecture is right. WebSocket-based streaming for both STT and TTS means you can pipeline everything. Audio comes in, gets transcribed as it arrives, the LLM starts generating while the user is still finishing their sentence, and TTS starts producing audio from the first tokens. The result is perceived latency that's dramatically lower than batch-process approaches.

Production-ready controls. The padding_bonus parameter for speed (-4.0 to +4.0), the temp parameter for vocal variety, and the rewrite rules for language-specific number/date/URL pronunciation are small things that make a big difference in production. When your voice agent reads back a phone number with proper locale-specific grouping instead of digit-by-digit, it sounds professional.


Practical Tips

A few lessons learned from the integration:

1. Sample rate management: Gradium STT expects 24kHz input, TTS outputs 48kHz. If your voice backend runs at 16kHz (common for telephony), configure RoomKit's resampler in the audio pipeline. The framework handles this, but be explicit about it.

2. Connection pooling: Gradium's WebSocket sessions last up to 300 seconds. For long conversations, implement reconnection logic. RoomKit's circuit breaker pattern helps here: if the WebSocket drops, the circuit opens, messages are queued, and reconnection happens with exponential backoff.

3. EU vs US endpoints: Gradium has servers in both Europe (eu.api.gradium.ai) and the US (us.api.gradium.ai). Choose based on your users' location for lowest latency. The EU endpoint also helps with data residency compliance if that matters for your industry.

4. Voice cloning for brand identity: Gradium's instant voice cloning (10-second sample) lets you create a branded voice for your agent. Combined with RoomKit's per-room AI configuration, you can have different agents with different voices in different rooms.

5. The <flush> tag: When streaming LLM output to TTS, use <flush> at natural sentence boundaries. The model buffers text for context before generating audio, so flushing at the right moments keeps latency low while maintaining natural prosody.


What's Next

RoomKit is open source (MIT licensed) and actively looking for contributors. The Gradium provider integration is a great example of how the pluggable architecture works, and I'd love to see the community build providers for other services.

If you're building voice AI in Python, here are the links:

Star the repos, try the quickstart, open an issue. The voice AI space is moving fast, and having providers with native multilingual support and solid streaming APIs makes the developer experience significantly better.