Skip to content

Video Support Overview

RoomKit's video subsystem brings real-time video capture, AI-powered vision, recording, and avatar rendering into the same room-based architecture used for voice and text. Video flows through the same channels, hooks, and rooms as every other modality.


Architecture at a Glance

VideoBackend (capture/transport)
VideoChannel (session lifecycle, hooks)
VideoPipeline (optional processing)
    ├─ [Decoder] → [Resizer] → [Transforms] → [Filters]
    ├─ VideoBridge (forward frames to other sessions)
    ├─ VisionProvider (frame → text description)
    │       │
    │       ▼
    │   ON_VISION_RESULT hook / inject into AI context
    └─ Recorder (frames → MP4)

A VideoBackend captures or receives frames. A VideoChannel wires the backend into a room — managing sessions, firing hooks, and optionally running a VideoPipeline for decoding, resizing, filtering, and vision analysis. Results flow into the conversation via hooks or direct AI context injection.


Video Backends

Backends are pure transports — they capture or receive frames and deliver them via callbacks.

Backend Source Install Guide
LocalVideoBackend Webcam (OpenCV) roomkit[local-video] Video & Vision
ScreenCaptureBackend Screen / monitor / region roomkit[screen-capture] Screen Capture
RTPVideoBackend Direct RTP endpoints roomkit[rtp] Video Backends
SIPVideoBackend Full SIP signaling (INVITE/BYE/SDP) roomkit[sip] Video Backends
FastRTCVideoBackend WebRTC via FastRTC roomkit[fastrtc]
WebSocketVideoBackend Browser WebSocket
MockVideoBackend Simulated frames Built-in

RTP and SIP backends extend their voice counterparts — a single backend handles both audio and video for a call, with audio-only fallback when the remote party doesn't offer video.

Screen capture supports multi-monitor selection, region cropping, resolution downscaling (saves vision API tokens), and diff-based frame skipping for static screens.


Vision Providers

Vision providers analyze video frames and return text descriptions, OCR results, and face detections.

Provider API Install
GeminiVisionProvider Google Gemini roomkit[gemini]
OpenAIVisionProvider OpenAI / Ollama / vLLM roomkit[openai]
MockVisionProvider Testing Built-in

Vision results are delivered through the ON_VISION_RESULT hook and can be automatically injected into AI conversation context:

# Continuous: wire vision results into an AI channel
setup_video_vision(kit, room_id="room", ai_channel_id="ai")

# Continuous: wire vision results into a realtime voice session
setup_realtime_vision(kit, room_id="room", voice_channel_id="voice")

For on-demand analysis (agent tools instead of continuous sampling), see Screen & Vision Tools.


Agent Vision Tools

AI agents can capture and analyze frames on demand — no continuous streaming required:

Tool What it does
DescribeScreenTool Capture screenshot → vision AI description
DescribeWebcamTool Capture webcam frame → vision AI description
ListWebcamsTool Enumerate available cameras
ScreenInputTools Vision-assisted click, type, scroll, press keys

These are Tool protocol objects — pass them directly to any channel via tools=[...]:

from roomkit import Agent
from roomkit.video.vision.screen_tool import DescribeScreenTool
from roomkit.video.vision.screen_input import ScreenInputTools

screen_tool = DescribeScreenTool(vision=gemini_vision, monitor=1)
input_tools = ScreenInputTools(vision=gemini_vision, monitor=1)

agent = Agent("assistant", provider=ai, tools=[screen_tool, *input_tools.tools])

See Screen & Vision Tools for the full guide.


Video Pipeline

Optional processing chain applied to inbound frames:

[Decoder] → [Resizer] → [Transforms] → [Filters] → taps / vision / recorder
Stage Purpose Implementations
Decoder Compressed → raw frames PyAVVideoDecoder
Resizer Resolution scaling PyAVVideoResizer
Transform Visual effects Grayscale, blur, and more
Filter Annotation / detection WatermarkFilter, YOLOFilter, CensorFilter
Encoder Raw → compressed (outbound) PyAVVideoEncoder (H.264)
from roomkit.video.pipeline.config import VideoPipelineConfig
from roomkit.video.pipeline.filter.watermark import WatermarkFilter

video = VideoChannel(
    "video", backend=backend,
    pipeline=VideoPipelineConfig(
        filters=[WatermarkFilter("RoomKit | {timestamp}", position="bottom-right")],
    ),
)

Recording

Three recording layers, from per-session to room-wide:

Layer Scope Output Guide
Audio pipeline recorder Per voice session WAV WAV Recorder
Video pipeline recorder Per video session MP4 PyAV Video Recorder
Room media recorder All channels in a room Single A/V MP4 Room Media Recorder

The room-level recorder muxes audio and video from multiple channels into a single file with A/V sync maintained via a shared monotonic clock:

from roomkit.recorder import MediaRecordingConfig, RoomRecorderBinding
from roomkit.recorder.pyav import PyAVMediaRecorder

voice = VoiceChannel("voice", ...)
video = VideoChannel("video", ...)

room = await kit.create_room(
    room_id="my-room",
    recorders=[RoomRecorderBinding(
        recorder=PyAVMediaRecorder(),
        config=MediaRecordingConfig(storage="./recordings", video_codec="auto"),
    )],
)

codec="auto" uses NVIDIA NVENC when available, falls back to libx264.


Avatars

Photorealistic talking-head avatars with lip-synced video output.

Provider Type Guide
AnamRealtimeProvider Cloud (WebRTC) Anam AI Avatar
MuseTalkAvatarProvider Local (GPU)
WebSocketAvatarProvider Custom WebSocket

RealtimeAVBridge wires any voice/video backend to an avatar provider, handling audio resampling, H.264 encoding, video pipeline, and session lifecycle:

from roomkit.providers.anam.config import AnamConfig
from roomkit.providers.anam.realtime import AnamRealtimeProvider
from roomkit.voice.realtime.bridge import RealtimeAVBridge

bridge = RealtimeAVBridge(
    AnamRealtimeProvider(AnamConfig(api_key="...", avatar_id="...")),
    sip_backend,
)

For room-based integration with hooks, use RealtimeAudioVideoChannel instead.


Hook Triggers

Hook Type When
ON_VIDEO_SESSION_STARTED Async Video session connected
ON_VIDEO_SESSION_ENDED Async Video session disconnected
ON_VIDEO_TRACK_ADDED Async New video track in session
ON_VIDEO_TRACK_REMOVED Async Video track removed
ON_VISION_RESULT Async Vision analysis completed
ON_SCREEN_SHARE_STARTED Async Screen share began
ON_SCREEN_SHARE_STOPPED Async Screen share ended
BEFORE_BRIDGE_VIDEO Sync Video frame about to be forwarded via bridge

Quick Start by Use Case

I want to... Backend + Guide
Analyze webcam frames with AI LocalVideoBackend GeminiVisionProvider Video & Vision
Build a screen assistant ScreenCaptureBackend ScreenInputTools Screen & Vision Tools
Add video to SIP calls SIPVideoBackend Video Backends
Record video sessions Any backend PyAVVideoRecorder PyAV Video Recorder
Record full A/V rooms Voice + Video PyAVMediaRecorder Room Media Recorder
Show an AI avatar SIPVideoBackend AnamRealtimeProvider Anam AI Avatar
Bridge two video calls SIPVideoBackend VideoBridge Video Bridging
Stream via WebRTC FastRTCVideoBackend

Examples

Example What it demonstrates
webcam_vision.py Webcam + Gemini vision analysis
webcam_assistant.py Chat agent with on-demand webcam tool
webcam_recording.py OpenCV video recording
webcam_censor.py YOLO + censor filter pipeline
screen_describe.py Screen capture + vision description
screen_assistant_ia.py Voice + screen AI assistant
screen_agent_orchestrated.py Multi-agent screen automation
sip_video_bridge.py Bridge two SIP A/V calls (audio + video forwarding)
sip_video_call.py Full SIP audio+video call
rtp_video_call.py Direct RTP video
sip_anam_avatar.py SIP → Anam avatar bridge
pyav_video_recorder.py H.264/NVENC recording
room_media_recorder.py Room-level A/V muxing
webrtc_video.py WebRTC video backend
websocket_video.py WebSocket video transport