Back to blog

Your Voice Assistant Doesn't Need the Cloud

February 7, 2026 · 14 min read

Build a fully local, open-source voice assistant in Python — no API keys, no subscriptions, no data leaving your machine.


If you've ever built a voice assistant, you know the drill: sign up for Deepgram, get an ElevenLabs API key, wire up OpenAI, watch the invoices stack up, and hope your users are comfortable sending their audio to three different cloud providers.

I wanted something different. A voice assistant that runs entirely on my machine — STT, LLM, TTS, everything — with zero cloud dependencies. And I wanted to build it with the same clean abstractions I'd use for a cloud-based setup.

Here's what I ended up with: a fully local voice pipeline running on a single NVIDIA 4070, responding in under 300ms after the initial warmup. Everything open source. Everything local.

I built it with RoomKit, an open-source Python framework I created for multi-channel conversation orchestration.

The Stack

No API keys. No cloud. Just models running on your GPU:

Component Tool Role
STT Kroko ASR Speech-to-text (streaming Zipformer via sherpa-onnx)
LLM Qwen 2.5 4B Language model via Ollama
TTS Piper (fr_FR-siwis-medium) Text-to-speech (VITS via sherpa-onnx)
VAD TEN-VAD Voice activity detection (sherpa-onnx)
Orchestration RoomKit Wires it all together

The entire audio pipeline looks like this:

Mic → [Resampler] → [AEC] → [Denoiser] → VAD → STT → LLM → TTS → Speaker

Every component runs locally. The heaviest lift is the LLM at 4B parameters — small enough to leave plenty of VRAM for the ONNX models.

Why sherpa-onnx?

When I started building the local voice pipeline for RoomKit, I looked at the usual suspects: faster-whisper, whisper.cpp, Coqui TTS. All great projects. But sherpa-onnx stood out for one reason: it covers the entire audio stack.

One library gives you streaming STT (Zipformer, Whisper, Paraformer), TTS (VITS/Piper voices), neural VAD (TEN-VAD, Silero), speech enhancement (GTCRN denoiser), and even echo cancellation support. All through ONNX Runtime, which means you get CUDA acceleration without PyTorch overhead.

For RoomKit, this was ideal. I could build providers for STT, TTS, VAD, and denoising that all share the same runtime, the same model format, and the same deployment story.

Show Me the Code

Let's build this step by step, starting simple.

Step 1: The providers

Each piece of the pipeline is a RoomKit provider — a pluggable component you can swap without changing your application logic:

from roomkit.voice.stt.sherpa_onnx import SherpaOnnxSTTProvider, SherpaOnnxSTTConfig
from roomkit.voice.tts.sherpa_onnx import SherpaOnnxTTSProvider, SherpaOnnxTTSConfig
from roomkit.voice.pipeline.vad.sherpa_onnx import SherpaOnnxVADProvider, SherpaOnnxVADConfig

# Speech-to-text: Kroko ASR (streaming Zipformer)
stt = SherpaOnnxSTTProvider(SherpaOnnxSTTConfig(
    mode="transducer",
    encoder="kroko-encoder.onnx",
    decoder="kroko-decoder.onnx",
    joiner="kroko-joiner.onnx",
    tokens="tokens.txt",
    sample_rate=16000,
    provider="cuda",  # or "cpu"
))

# Text-to-speech: Piper French voice
tts = SherpaOnnxTTSProvider(SherpaOnnxTTSConfig(
    model="fr_FR-siwis-medium.onnx",
    tokens="tokens.txt",
    data_dir="espeak-ng-data",
    sample_rate=22050,
    provider="cuda",
))

# Voice activity detection: TEN-VAD
vad = SherpaOnnxVADProvider(SherpaOnnxVADConfig(
    model="ten-vad.onnx",
    model_type="ten",
    threshold=0.5,
    silence_threshold_ms=600,
    sample_rate=16000,
    provider="cpu",  # VAD is tiny — CPU is actually faster
))

Notice that VAD runs on CPU. The model is so small (runs every 20ms) that the GPU transfer overhead makes CUDA slower for it. RoomKit lets you mix providers — CUDA for the heavy models, CPU for the lightweight ones.

Step 2: The audio pipeline

RoomKit's AudioPipelineConfig chains the preprocessing stages together:

from roomkit.voice.pipeline import AudioPipelineConfig

pipeline = AudioPipelineConfig(
    vad=vad,
    # Optional: add denoiser and echo cancellation
    # denoiser=SherpaOnnxDenoiserProvider(config),
    # aec=SpeexAECProvider(frame_size=320),
)

Want to add noise reduction? Uncomment one line. Echo cancellation for a speaker setup? One more line. Each stage is optional and pluggable.

Step 3: The LLM

Any OpenAI-compatible server works — Ollama, vLLM, LM Studio:

from roomkit import VLLMConfig, create_vllm_provider
from roomkit.channels.ai import AIChannel

ai_provider = create_vllm_provider(VLLMConfig(
    model="qwen2.5:4b",
    base_url="http://localhost:11434/v1",  # Ollama
    max_tokens=256,
))

ai = AIChannel("ai", provider=ai_provider, system_prompt=(
    "You are a friendly voice assistant. Keep responses "
    "short and conversational — one or two sentences."
))

Step 4: Wire it all together

This is where RoomKit shines. The room is the conversation — channels attach to it, messages flow through hooks:

from roomkit import RoomKit, VoiceChannel, ChannelCategory
from roomkit.voice.backends.local import LocalAudioBackend

kit = RoomKit()

# Local mic + speakers
backend = LocalAudioBackend(
    input_sample_rate=16000,
    output_sample_rate=22050,
    channels=1,
    block_duration_ms=20,
)

# Register channels
voice = VoiceChannel("voice", stt=stt, tts=tts, backend=backend, pipeline=pipeline)
kit.register_channel(voice)
kit.register_channel(ai)

# Create a room and attach both channels
await kit.create_room(room_id="local-voice")
await kit.attach_channel("local-voice", "voice")
await kit.attach_channel("local-voice", "ai", category=ChannelCategory.INTELLIGENCE)

That's it. Speak into your mic, the audio flows through VAD → STT → LLM → TTS → speaker. No cloud. No API keys.

Step 5: Add hooks for visibility

RoomKit's hook system lets you intercept events at any point — logging, moderation, analytics, custom routing:

from roomkit import HookTrigger, HookResult, HookExecution

@kit.hook(HookTrigger.ON_TRANSCRIPTION)
async def on_transcription(text, ctx):
    print(f"You said: {text}")
    return HookResult.allow()

@kit.hook(HookTrigger.BEFORE_TTS)
async def before_tts(text, ctx):
    print(f"Assistant: {text}")
    return HookResult.allow()

@kit.hook(HookTrigger.ON_SPEECH_START, execution=HookExecution.ASYNC)
async def on_speech_start(session, ctx):
    print("Listening...")

Hooks aren't just for logging. You can block messages, reroute them, transform text before TTS, or trigger external actions — all without touching the pipeline code.

What It Looks Like in Practice

Here's an actual conversation running on my machine (RTX 4070, Ollama with Qwen 2.5 4B, French models):

14:05:08 You said: Salut, comment ça va
14:05:12 Assistant: Ça va bien, merci! Et toi?

14:05:18 You said: Ouais, ça va plutôt pas mal
14:05:18 Assistant: C'est cool! Fais-toi repos et profite du bon temps.

14:05:28 You said: Qu'est-ce que tu sais faire en fait ?
14:05:29 Assistant: Je peux répondre aux questions, donner des suggestions,
          raconter des histoires et aider avec plein de choses!

The first response takes ~4 seconds (Ollama cold start). After that, the LLM responds in ~280ms. The full loop — end of speech to start of audio playback — is well under 2 seconds.

And yes, barge-in works. Start talking while the assistant is speaking and the TTS stops immediately:

14:05:34 TTS interrupted: reason=barge_in, position=4751ms
14:05:34 Speech started (new utterance)

This is the kind of thing that's painful to implement from scratch but comes built into RoomKit's voice pipeline.

The Real Point: Swap Without Rewriting

Here's what I find most interesting about this setup. The local voice pipeline uses the exact same RoomKit abstractions as the cloud-based one. If tomorrow you want to switch to Deepgram for STT and ElevenLabs for TTS, you swap the providers:

# Local
stt = SherpaOnnxSTTProvider(config)
tts = SherpaOnnxTTSProvider(config)

# Cloud (same interface, same hooks, same room)
stt = DeepgramSTTProvider(api_key="...")
tts = ElevenLabsTTSProvider(api_key="...", voice_id="...")

Your hooks don't change. Your room logic doesn't change. Your recording, moderation, and routing code stays the same. That's the whole point of RoomKit — the channel and provider are implementation details; the conversation is the abstraction.

Getting Started

Prerequisites:

Install:

pip install roomkit[local-audio,openai,sherpa-onnx]
ollama pull qwen2.5:4b

Download models (one time):

# VAD
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/vad-models/ten-vad.onnx

# STT - Kroko ASR
# (download from https://huggingface.co/Banafo/Kroko-ASR)

# TTS - Piper French voice
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-fr_FR-siwis-medium.tar.bz2
tar xf vits-piper-fr_FR-siwis-medium.tar.bz2

For GPU acceleration (optional but recommended):

# Install cuDNN 9
sudo apt-get install cudnn9-cuda-12

# Install CUDA wheel for sherpa-onnx
pip install sherpa-onnx==1.12.23+cuda12.cudnn9 \
    -f https://k2-fsa.github.io/sherpa/onnx/cuda.html

export ONNX_PROVIDER=cuda

Run the full example:

git clone https://github.com/roomkit-live/roomkit
cd roomkit

# Set your model paths and run
LLM_MODEL=qwen2.5:4b \
LLM_BASE_URL=http://localhost:11434/v1 \
VAD_MODEL=ten-vad.onnx \
STT_ENCODER=<path-to-encoder> \
STT_DECODER=<path-to-decoder> \
STT_JOINER=<path-to-joiner> \
STT_TOKENS=<path-to-tokens> \
TTS_MODEL=fr_FR-siwis-medium.onnx \
TTS_TOKENS=tokens.txt \
TTS_DATA_DIR=espeak-ng-data \
ONNX_PROVIDER=cuda \
python examples/voice_local_onnx_vllm.py

The full example on GitHub includes everything: echo cancellation, noise reduction, WAV recording, debug taps, and graceful shutdown.

What's Next

This example is a starting point. RoomKit's architecture means you can extend it in any direction:

The full source is at github.com/roomkit-live/roomkit. Star the repo, try the example, open an issue. The framework is MIT licensed and actively looking for contributors.