Video Backends (RTP & SIP)¶

Transport backends for sending and receiving real-time video over RTP, with optional SIP signaling. These extend the voice backends with parallel video RTP sessions, so a single backend handles both audio and video for a call.

Architecture¶

Both video backends use multiple inheritance to combine voice and video transport:

RTPVideoBackend(RTPVoiceBackend, VideoBackend)
SIPVideoBackend(SIPVoiceBackend, VideoBackend)

A single connect() or incoming INVITE creates both audio and video RTP sessions. The video session shares the same session ID as the voice session, so you can look up either from the same ID.

Inbound:
  RTP video packets → aiortp VideoRTPSession → H.264 depacketization
    → VideoFrame(codec="h264", data=nal_bytes, keyframe, timestamp_ms)
    → on_video_received callback

Outbound:
  send_video(session, bytes | AsyncIterator[VideoChunk])
    → H.264 packetization → RTP video packets → remote

RTPVideoBackend¶

Direct RTP video transport with pre-configured endpoints. No SIP signaling — you supply the addresses.

Quick start¶

from roomkit.video.backends.rtp import RTPVideoBackend
from roomkit import VoiceChannel
from roomkit.voice.pipeline import AudioPipelineConfig

backend = RTPVideoBackend(
    local_addr=("0.0.0.0", 10000),
    remote_addr=("192.168.1.100", 20000),
    video_local_addr=("0.0.0.0", 10002),
    video_remote_addr=("192.168.1.100", 20002),
)

voice = VoiceChannel("voice", stt=stt, tts=tts, backend=backend, pipeline=pipeline)

# Pull model: kit.join() creates the session, binds it, and wires recording
session = await kit.join("room-1", "voice", participant_id="user-1")
# Audio + video RTP sessions are now active

video_session = backend.get_video_session(session.id)

Install with:

pip install roomkit[rtp]

Constructor parameters¶

All RTPVoiceBackend parameters plus:

Parameter	Type	Default	Description
`video_local_addr`	`(str, int)`	`("0.0.0.0", 0)`	Host and port to bind video RTP.
`video_remote_addr`	`(str, int) \\| None`	`None`	Host and port to send video RTP to. Can be overridden per-session.
`video_payload_type`	`int`	`96`	RTP payload type for video (96 = H.264).
`video_clock_rate`	`int`	`90000`	Video RTP clock rate in Hz.

Per-session video remote address¶

Like the voice backend, you can omit video_remote_addr from the constructor and supply it per-session:

backend = RTPVideoBackend(
    local_addr=("0.0.0.0", 10000),
    remote_addr=("10.0.0.1", 20000),
    video_local_addr=("0.0.0.0", 10002),
)

session = await kit.join(
    "room-1", "voice",
    participant_id="user-1",
    metadata={"video_remote_addr": ("10.0.0.1", 20002)},
)

If neither the constructor nor metadata provides a video remote address, connect() raises ValueError.

SIPVideoBackend¶

SIP-based video transport with full call signaling. Extends SIPVoiceBackend — when an incoming INVITE contains m=video, the backend negotiates both audio and video codecs and creates parallel RTP sessions. Audio-only calls work transparently.

Quick start¶

from roomkit.video.backends.sip import SIPVideoBackend
from roomkit import VoiceChannel
from roomkit.voice.base import VoiceSession

backend = SIPVideoBackend(
    local_sip_addr=("0.0.0.0", 5060),
    local_rtp_ip="10.0.0.5",
    rtp_port_start=10000,
    supported_video_codecs=["H264", "VP8"],
)

# Video frame callback
def on_video(session, frame):
    print(f"Video: codec={frame.codec}, keyframe={frame.keyframe}, seq={frame.sequence}")

backend.on_video_received(on_video)

# Route incoming calls
async def on_call(session: VoiceSession):
    has_video = session.metadata.get("has_video", False)
    print(f"Call from {session.metadata['caller']}, video={has_video}")

backend.on_call(on_call)
await backend.start()

Install with:

pip install roomkit[sip]

Constructor parameters¶

All SIPVoiceBackend parameters plus:

Parameter	Type	Default	Description
`supported_video_codecs`	`list[str] \\| None`	`["H264"]`	Video codec names to accept in SDP negotiation.

How A/V calls work¶

PBX/SIP Trunk                    SIPVideoBackend
─────────────                    ───────────────
  INVITE ──────────────────────► receives call
  (SDP with m=audio + m=video)     │
                                   ├── Audio SDP negotiation (CallSession)
                                   ├── Video SDP negotiation (VideoCallSession)
                                   ├── Combined SDP answer (audio + video)
                                   │
  ◄──── 100 Trying                 │
  ◄──── 180 Ringing                │
  ◄──── 200 OK (combined SDP)     ├── on_call fires (has_video=True)
                                   │
  RTP audio ◄──────────────────►  audio pipeline
  RTP video ◄──────────────────►  on_video_received
                                   │
  BYE ─────────────────────────► cleanup (audio + video)

When the INVITE has no m=video, the call is handled as audio-only by the parent SIPVoiceBackend — no video session is created.

Graceful fallback¶

If video SDP negotiation fails (no matching codecs), the call proceeds audio-only. The has_video metadata flag will be False:

async def on_call(session: VoiceSession):
    if session.metadata.get("has_video"):
        video_session = backend.get_video_session(session.id)
        # wire up video processing
    else:
        # audio-only call
        pass

Receiving video¶

Register a callback to receive inbound video frames:

from roomkit.video.video_frame import VideoFrame

def on_video(session, frame: VideoFrame):
    print(f"codec={frame.codec}, size={len(frame.data)}, "
          f"keyframe={frame.keyframe}, seq={frame.sequence}, "
          f"ts={frame.timestamp_ms:.1f}ms")

backend.on_video_received(on_video)

VideoFrame fields:

Field	Type	Description
`data`	`bytes`	Raw NAL unit data (H.264) or encoded frame.
`codec`	`str`	Codec identifier (e.g. `"h264"`).
`timestamp_ms`	`float \\| None`	Presentation timestamp in milliseconds.
`keyframe`	`bool`	Whether this is an IDR/keyframe.
`sequence`	`int`	Monotonically increasing frame counter per session.

Sending video¶

Send video frames to the remote party:

# Send raw bytes (single NAL unit)
await backend.send_video(video_session, nal_bytes)

# Send a stream of VideoChunks
from roomkit.video.base import VideoChunk

async def video_source():
    yield VideoChunk(data=nal_bytes, keyframe=True, timestamp_ms=0)
    yield VideoChunk(data=nal_bytes, keyframe=False, timestamp_ms=33)

await backend.send_video(video_session, video_source())

Session queries¶

# Get video session by ID (same ID as voice session)
video_session = backend.get_video_session(session.id)

# List all video sessions in a room
sessions = backend.list_video_sessions("room-1")

Callbacks¶

Callback	Description
`on_video_received(cb)`	Inbound video frames (NAL units).
`on_video_session_ready(cb)`	Video session created and active.
`on_client_disconnected(cb)`	Video session ended (call teardown).

RTP vs SIP video backend¶

Feature	RTP video	SIP video
SIP signaling	Not included	Built-in (INVITE, BYE, SDP)
SDP negotiation	Manual codec config	Automatic codec selection
Session creation	Manual via `connect()`	Automatic on INVITE
Remote address	Must be configured	From SDP offer
Audio-only fallback	N/A (always creates video)	Automatic if no `m=video`
Dependencies	`aiortp`	`aiosipua[rtp]`
Use case	Direct RTP endpoints	PBX/trunk integration

Example¶

See examples/sip_video_call.py for a complete SIP A/V call handler with incoming call routing and video frame logging.