RoomKit Speaks SIP: Bridge Any PBX to Conversational AI

What if you could turn any phone call into a live conversation with an AI, in under 50 lines of Python?

RoomKit now supports SIP natively. Incoming calls from any SIP-compatible PBX, proxy, or trunk are answered automatically and bridged to a conversational AI model in real time. The caller speaks naturally, the AI responds instantly, over a regular phone line.

No WebRTC. No browser. Just a phone call.

Why SIP?

The world runs on phone calls. Enterprise contact centers, medical clinics, financial advisory firms, government services: they all rely on telephony infrastructure built on SIP. Millions of PBX systems, SIP trunks, and softphones are deployed worldwide.

Meanwhile, the most capable conversational AI models now support real-time speech-to-speech interaction. The missing piece was a clean, lightweight bridge between these two worlds.

RoomKit fills that gap. It handles SIP signaling, RTP media transport, audio resampling, and session management, so you can focus on the conversation, not the plumbing.

How It Works

The architecture is straightforward:

Phone Call → PBX / SIP Proxy → RoomKit → AI Model
                                  ↕
                          Audio resampling
                       16 kHz (SIP) ↔ 24 kHz (AI)

When a call arrives:

RoomKit's SIP backend accepts the INVITE and negotiates codecs (G.722 wideband preferred, G.711 fallback)
An RTP session is established for bidirectional audio
The call flows through process_inbound(), the same unified entry point used for text messages, running through all registered hooks (spam filtering, logging, traces)
A room is created, the channel is attached, and a realtime AI session starts automatically
Audio flows both ways: the caller's voice goes to the AI, the AI's response goes back to the caller
When either side hangs up, everything is cleaned up automatically

The PBX can pass application context through custom SIP headers (X-Room-ID, X-Session-ID), which RoomKit uses to route calls to the right room and session. This makes it easy to build multi-tenant systems, IVR flows, or context-aware assistants that know who's calling and why.

The Code

Here's a complete working example: a SIP server that answers calls, blocks spam, logs protocol traces, and connects callers to Google Gemini Live. Voice sessions flow through process_inbound(), the same entry point used for text messages, with the same hooks and pipeline:

import asyncio
import os

from roomkit import RealtimeVoiceChannel, RoomKit
from roomkit.models.context import RoomContext
from roomkit.models.enums import HookTrigger
from roomkit.models.event import RoomEvent, SystemContent
from roomkit.models.hook import HookResult
from roomkit.models.trace import ProtocolTrace
from roomkit.providers.gemini.realtime import GeminiLiveProvider
from roomkit.voice import parse_voice_session
from roomkit.voice.backends.sip import SIPVoiceBackend
from roomkit.voice.realtime.sip_transport import SIPRealtimeTransport

BLOCKED_CALLERS = {"+15550000000"}

async def main():
    kit = RoomKit()

    # SIP backend: answers incoming calls
    sip = SIPVoiceBackend(
        local_sip_addr=("0.0.0.0", 5060),
        local_rtp_ip="0.0.0.0",
        rtp_port_start=10000,
        rtp_port_end=20000,
    )

    # AI provider + bridge transport
    gemini = GeminiLiveProvider(
        api_key=os.environ["GOOGLE_API_KEY"],
        model="gemini-2.5-flash-native-audio-preview-12-2025",
    )
    transport = SIPRealtimeTransport(sip)

    realtime = RealtimeVoiceChannel(
        "realtime-voice",
        provider=gemini,
        transport=transport,
        system_prompt="You are a friendly phone assistant.",
        voice="Aoede",
        input_sample_rate=16000,
        output_sample_rate=24000,
    )
    kit.register_channel(realtime)

    # Hooks — same hooks work for voice AND text
    @kit.hook(HookTrigger.BEFORE_BROADCAST)
    async def gate_incoming(event: RoomEvent, ctx: RoomContext) -> HookResult:
        if isinstance(event.content, SystemContent) and event.content.code == "session_started":
            caller = event.content.data.get("caller")
            if caller in BLOCKED_CALLERS:
                return HookResult.block(f"caller_blocked:{caller}")
        return HookResult.allow()

    @kit.hook(HookTrigger.ON_PROTOCOL_TRACE)
    async def on_trace(trace: ProtocolTrace, ctx: RoomContext) -> None:
        print(f"[{ctx.room.id}] {trace.direction} {trace.protocol}: {trace.summary}")

    # Incoming call → unified process_inbound (same as text)
    @sip.on_call
    async def handle_call(session):
        result = await kit.process_inbound(
            parse_voice_session(session, channel_id="realtime-voice"),
            room_id=session.metadata.get("room_id"),
        )
        if result.blocked:
            print(f"Call rejected: {result.reason}")

    # Hangup → cleanup
    @sip.on_call_disconnected
    async def handle_disconnect(session):
        room_id = session.metadata.get("room_id", session.id)
        for rt in realtime._get_room_sessions(room_id):
            await realtime.end_session(rt)
        await kit.close_room(room_id)

    await sip.start()
    print("Waiting for SIP calls on port 5060...")
    await asyncio.Event().wait()

asyncio.run(main())

That's it. Point your PBX at this server, dial in, and talk to the AI. The BEFORE_BROADCAST hook blocks spam callers the same way it would block abusive text messages. The ON_PROTOCOL_TRACE hook logs SIP signaling (INVITE, 200 OK, BYE) with full raw payloads. Same hooks, same interface, whether the input is a phone call or a webhook.

Audio Quality Matters

Telephony traditionally uses narrowband audio: 8 kHz G.711, the same codec from the 1970s. It works, but it's not great for AI. Speech recognition accuracy drops, and synthesized voices sound tinny and robotic.

RoomKit negotiates G.722 wideband (16 kHz) when the PBX supports it, which most modern systems do. The difference is immediately noticeable: the AI hears the caller more clearly, and the AI's voice sounds natural and human. For AI models that operate at 24 kHz, RoomKit handles the resampling transparently, no configuration needed.

When G.722 isn't available, RoomKit falls back to G.711 (PCMU/PCMA) automatically. It just works.

Under the Hood

RoomKit's SIP support is built on two open-source Python libraries we developed specifically for this use case:

aiortp: an asyncio-native RTP library. Handles packet send/receive, jitter buffering, G.711 and G.722 codecs, DTMF (RFC 4733), and RTCP. Pure Python, zero dependencies.

aiosipua: an asyncio SIP micro-library. Parses and builds SIP messages, negotiates SDP, manages dialogs, and provides a clean UAS/UAC API. Designed to sit behind a SIP proxy, it handles signaling, not routing.

Both libraries are intentionally minimal. They don't try to be a full PBX or a complete VoIP stack. They do exactly what's needed to bridge SIP calls to application logic, and nothing more.

The key design decisions:

Async-native: everything runs on Python's asyncio event loop. No threads, no blocking. A single process can handle many concurrent calls.
Zero required dependencies: both libraries use only the Python standard library. Optional codecs (Opus) can be added if needed.
Proxy-friendly: the SIP library is designed to work behind Kamailio, OpenSIPS, Asterisk, FreeSWITCH, or any SIP proxy. It doesn't handle authentication, retransmissions, or NAT traversal, that's the proxy's job.
Custom header support: application metadata flows through standard SIP X-headers. The PBX sets them, RoomKit reads them. Simple.

What You Can Build

This opens up a range of applications:

AI-powered IVR: replace rigid menu trees with natural language. "What would you like to do?" instead of "Press 1 for..."
Phone-based assistants: customer support, appointment scheduling, information hotlines, all handled by AI with human-like conversation.
Compliance recording with AI: in regulated industries like financial services, calls must be recorded. RoomKit can transcribe and analyze calls in real time while maintaining compliance.
Multilingual support: the AI model handles language switching naturally. A single phone number can serve callers in any language.
Hybrid human/AI workflows: start with AI, escalate to a human agent when needed, or have the AI assist the human agent in real time.

Getting Started

Install RoomKit with SIP support:

pip install roomkit[sip,realtime-gemini]

Configure your PBX to route calls to the RoomKit server's IP on port 5060. Add X-Room-ID and X-Session-ID headers in your dialplan for session routing.

Run the example:

GOOGLE_API_KEY=your-key python examples/realtime_voice_sip_gemini.py

Pick up the phone and start talking.

What's Already There

RoomKit is not just a SIP bridge. It already supports multiple AI providers (Gemini Live, OpenAI Realtime), multi-party rooms where several participants and AI channels coexist, and edge deployment with ONNX models for low-latency, privacy-sensitive applications.

SIP support adds the last missing piece: connecting all of this to the phone network. Any PBX, any trunk, any softphone can now be a front door to conversational AI.

The goal is simple: any phone, any AI, any language, connected with a few lines of Python.