RTP Voice Backend¶

A voice backend that sends and receives audio over RTP, for integration with PBX/SIP gateways or any system that speaks RTP. Uses the aiortp library for packet handling, codec encoding/decoding, and RFC 4733 DTMF.

Quick start¶

from roomkit.voice.backends.rtp import RTPVoiceBackend
from roomkit.voice.pipeline import AudioPipelineConfig, MockVADProvider
from roomkit import VoiceChannel

backend = RTPVoiceBackend(
    local_addr=("0.0.0.0", 10000),
    remote_addr=("192.168.1.100", 20000),
)

pipeline = AudioPipelineConfig(vad=vad)
voice = VoiceChannel("voice", stt=stt, tts=tts, backend=backend, pipeline=pipeline)

session = await backend.connect("room-1", "user-1", "voice")
# ... pipeline processes inbound RTP audio, AI responds, TTS sends outbound RTP ...
await backend.disconnect(session)

Install with:

pip install roomkit[rtp]

Constructor parameters¶

Parameter	Type	Default	Description
`local_addr`	`(str, int)`	`("0.0.0.0", 0)`	Host and port to bind RTP. Use port `0` for OS-assigned.
`remote_addr`	`(str, int) \\| None`	`None`	Host and port to send RTP to. Can be overridden per-session.
`payload_type`	`int`	`0`	RTP payload type number (see codec table below).
`clock_rate`	`int`	`8000`	Clock rate in Hz. Must match the payload type.
`dtmf_payload_type`	`int`	`101`	RTP payload type for RFC 4733 DTMF events.
`rtcp_interval`	`float`	`5.0`	Seconds between RTCP sender reports.

Codec selection¶

Set payload_type and clock_rate to match the codec negotiated with the remote endpoint:

Codec	`payload_type`	`clock_rate`	Notes
PCMU (G.711 mu-law)	`0`	`8000`	Default. Standard telephony codec.
PCMA (G.711 A-law)	`8`	`8000`	Common in European telephony.
L16 (uncompressed)	`11`	`44100`	Linear 16-bit PCM. High bandwidth.

# G.711 A-law example
backend = RTPVoiceBackend(
    local_addr=("0.0.0.0", 10000),
    remote_addr=("pbx.local", 20000),
    payload_type=8,
    clock_rate=8000,
)

Per-session remote address¶

When connecting to multiple endpoints (e.g. a multi-tenant PBX), you can omit remote_addr from the constructor and supply it per-session via metadata["remote_addr"]:

backend = RTPVoiceBackend(local_addr=("0.0.0.0", 10000))

# Each session gets its own remote endpoint
session1 = await backend.connect(
    "room-1", "caller-1", "voice",
    metadata={"remote_addr": ("10.0.0.1", 20000)},
)
session2 = await backend.connect(
    "room-2", "caller-2", "voice",
    metadata={"remote_addr": ("10.0.0.2", 20000)},
)

If neither the constructor nor metadata provides a remote address, connect() raises ValueError.

DTMF¶

The backend receives DTMF digits out-of-band via RFC 4733 (telephone-event payload type). Inbound DTMF events are delivered as DTMFEvent objects containing digit (str) and duration_ms (float).

DTMF integrates with the hook system via HookTrigger.ON_DTMF:

from roomkit import HookTrigger, HookExecution

@kit.hook(HookTrigger.ON_DTMF, execution=HookExecution.ASYNC)
async def on_dtmf(event, ctx):
    print(f"DTMF digit: {event.digit}, duration: {event.duration_ms}ms")

You can also add an in-band MockDTMFDetector to the pipeline for testing:

from roomkit.voice.pipeline.dtmf import MockDTMFDetector

pipeline = AudioPipelineConfig(
    vad=vad,
    dtmf=MockDTMFDetector(),
)

Capabilities¶

The RTP backend declares two capabilities:

Capability	Description
`DTMF_SIGNALING`	DTMF digits are received out-of-band via RFC 4733. The pipeline skips in-band DTMF detection when this is set.
`INTERRUPTION`	Outbound audio playback can be cancelled mid-stream (for barge-in).

Audio flow¶

Inbound¶

Remote endpoint → RTP packets → aiortp decode → PCM bytes
  → AudioFrame(sample_rate=clock_rate, channels=1, sample_width=2)
  → on_audio_received callback → AudioPipeline inbound chain

aiortp handles codec decoding (G.711/L16) and delivers raw PCM-16 LE. The backend wraps each decoded buffer in an AudioFrame and passes it to the pipeline.

Outbound¶

TTS → AudioChunk stream or bytes → PCM-16 LE
  → 20ms RTP frames (clock_rate / 50 samples each)
  → aiortp encode → RTP packets → remote endpoint

The backend accepts either a complete bytes buffer or an AsyncIterator[AudioChunk]. PCM data is split into 20ms frames and sent with incrementing RTP timestamps.

SIP integration¶

The RTP backend handles only the media (RTP) plane. For full SIP signaling (INVITE, BYE, codec negotiation via SDP), use the SIP Voice Backend instead, which handles the complete call lifecycle automatically. The SIP backend uses aiosipua for signaling and creates RTP sessions from negotiated SDP.

API Reference¶

See the RTP Backend API Reference for auto-generated class documentation.

Example¶

See examples/voice_rtp.py for a complete runnable example with mock providers, DTMF handling, and session lifecycle.