Video Support Overview¶
RoomKit's video subsystem brings real-time video capture, AI-powered vision, recording, and avatar rendering into the same room-based architecture used for voice and text. Video flows through the same channels, hooks, and rooms as every other modality.
Architecture at a Glance¶
VideoBackend (capture/transport)
│
▼
VideoChannel (session lifecycle, hooks)
│
▼
VideoPipeline (optional processing)
│
├─ [Decoder] → [Resizer] → [Transforms] → [Filters]
│
├─ VideoBridge (forward frames to other sessions)
│
├─ VisionProvider (frame → text description)
│ │
│ ▼
│ ON_VISION_RESULT hook / inject into AI context
│
└─ Recorder (frames → MP4)
A VideoBackend captures or receives frames. A VideoChannel wires the backend into a room — managing sessions, firing hooks, and optionally running a VideoPipeline for decoding, resizing, filtering, and vision analysis. Results flow into the conversation via hooks or direct AI context injection.
Video Backends¶
Backends are pure transports — they capture or receive frames and deliver them via callbacks.
| Backend | Source | Install | Guide |
|---|---|---|---|
LocalVideoBackend |
Webcam (OpenCV) | roomkit[local-video] |
Video & Vision |
ScreenCaptureBackend |
Screen / monitor / region | roomkit[screen-capture] |
Screen Capture |
RTPVideoBackend |
Direct RTP endpoints | roomkit[rtp] |
Video Backends |
SIPVideoBackend |
Full SIP signaling (INVITE/BYE/SDP) | roomkit[sip] |
Video Backends |
FastRTCVideoBackend |
WebRTC via FastRTC | roomkit[fastrtc] |
— |
WebSocketVideoBackend |
Browser WebSocket | — | — |
MockVideoBackend |
Simulated frames | Built-in | — |
RTP and SIP backends extend their voice counterparts — a single backend handles both audio and video for a call, with audio-only fallback when the remote party doesn't offer video.
Screen capture supports multi-monitor selection, region cropping, resolution downscaling (saves vision API tokens), and diff-based frame skipping for static screens.
Vision Providers¶
Vision providers analyze video frames and return text descriptions, OCR results, and face detections.
| Provider | API | Install |
|---|---|---|
GeminiVisionProvider |
Google Gemini | roomkit[gemini] |
OpenAIVisionProvider |
OpenAI / Ollama / vLLM | roomkit[openai] |
MockVisionProvider |
Testing | Built-in |
Vision results are delivered through the ON_VISION_RESULT hook and can be automatically injected into AI conversation context:
# Continuous: wire vision results into an AI channel
setup_video_vision(kit, room_id="room", ai_channel_id="ai")
# Continuous: wire vision results into a realtime voice session
setup_realtime_vision(kit, room_id="room", voice_channel_id="voice")
For on-demand analysis (agent tools instead of continuous sampling), see Screen & Vision Tools.
Agent Vision Tools¶
AI agents can capture and analyze frames on demand — no continuous streaming required:
| Tool | What it does |
|---|---|
DescribeScreenTool |
Capture screenshot → vision AI description |
DescribeWebcamTool |
Capture webcam frame → vision AI description |
ListWebcamsTool |
Enumerate available cameras |
ScreenInputTools |
Vision-assisted click, type, scroll, press keys |
These are Tool protocol objects — pass them directly to any channel via tools=[...]:
from roomkit import Agent
from roomkit.video.vision.screen_tool import DescribeScreenTool
from roomkit.video.vision.screen_input import ScreenInputTools
screen_tool = DescribeScreenTool(vision=gemini_vision, monitor=1)
input_tools = ScreenInputTools(vision=gemini_vision, monitor=1)
agent = Agent("assistant", provider=ai, tools=[screen_tool, *input_tools.tools])
See Screen & Vision Tools for the full guide.
Video Pipeline¶
Optional processing chain applied to inbound frames:
| Stage | Purpose | Implementations |
|---|---|---|
| Decoder | Compressed → raw frames | PyAVVideoDecoder |
| Resizer | Resolution scaling | PyAVVideoResizer |
| Transform | Visual effects | Grayscale, blur, and more |
| Filter | Annotation / detection | WatermarkFilter, YOLOFilter, CensorFilter |
| Encoder | Raw → compressed (outbound) | PyAVVideoEncoder (H.264) |
from roomkit.video.pipeline.config import VideoPipelineConfig
from roomkit.video.pipeline.filter.watermark import WatermarkFilter
video = VideoChannel(
"video", backend=backend,
pipeline=VideoPipelineConfig(
filters=[WatermarkFilter("RoomKit | {timestamp}", position="bottom-right")],
),
)
Recording¶
Three recording layers, from per-session to room-wide:
| Layer | Scope | Output | Guide |
|---|---|---|---|
| Audio pipeline recorder | Per voice session | WAV | WAV Recorder |
| Video pipeline recorder | Per video session | MP4 | PyAV Video Recorder |
| Room media recorder | All channels in a room | Single A/V MP4 | Room Media Recorder |
The room-level recorder muxes audio and video from multiple channels into a single file with A/V sync maintained via a shared monotonic clock:
from roomkit.recorder import MediaRecordingConfig, RoomRecorderBinding
from roomkit.recorder.pyav import PyAVMediaRecorder
voice = VoiceChannel("voice", ...)
video = VideoChannel("video", ...)
room = await kit.create_room(
room_id="my-room",
recorders=[RoomRecorderBinding(
recorder=PyAVMediaRecorder(),
config=MediaRecordingConfig(storage="./recordings", video_codec="auto"),
)],
)
codec="auto" uses NVIDIA NVENC when available, falls back to libx264.
Avatars¶
Photorealistic talking-head avatars with lip-synced video output.
| Provider | Type | Guide |
|---|---|---|
AnamRealtimeProvider |
Cloud (WebRTC) | Anam AI Avatar |
MuseTalkAvatarProvider |
Local (GPU) | — |
WebSocketAvatarProvider |
Custom WebSocket | — |
RealtimeAVBridge wires any voice/video backend to an avatar provider, handling audio resampling, H.264 encoding, video pipeline, and session lifecycle:
from roomkit.providers.anam.config import AnamConfig
from roomkit.providers.anam.realtime import AnamRealtimeProvider
from roomkit.voice.realtime.bridge import RealtimeAVBridge
bridge = RealtimeAVBridge(
AnamRealtimeProvider(AnamConfig(api_key="...", avatar_id="...")),
sip_backend,
)
For room-based integration with hooks, use RealtimeAudioVideoChannel instead.
Hook Triggers¶
| Hook | Type | When |
|---|---|---|
ON_VIDEO_SESSION_STARTED |
Async | Video session connected |
ON_VIDEO_SESSION_ENDED |
Async | Video session disconnected |
ON_VIDEO_TRACK_ADDED |
Async | New video track in session |
ON_VIDEO_TRACK_REMOVED |
Async | Video track removed |
ON_VISION_RESULT |
Async | Vision analysis completed |
ON_SCREEN_SHARE_STARTED |
Async | Screen share began |
ON_SCREEN_SHARE_STOPPED |
Async | Screen share ended |
BEFORE_BRIDGE_VIDEO |
Sync | Video frame about to be forwarded via bridge |
Quick Start by Use Case¶
| I want to... | Backend | + | Guide |
|---|---|---|---|
| Analyze webcam frames with AI | LocalVideoBackend |
GeminiVisionProvider |
Video & Vision |
| Build a screen assistant | ScreenCaptureBackend |
ScreenInputTools |
Screen & Vision Tools |
| Add video to SIP calls | SIPVideoBackend |
— | Video Backends |
| Record video sessions | Any backend | PyAVVideoRecorder |
PyAV Video Recorder |
| Record full A/V rooms | Voice + Video | PyAVMediaRecorder |
Room Media Recorder |
| Show an AI avatar | SIPVideoBackend |
AnamRealtimeProvider |
Anam AI Avatar |
| Bridge two video calls | SIPVideoBackend |
VideoBridge |
Video Bridging |
| Stream via WebRTC | FastRTCVideoBackend |
— | — |
Examples¶
| Example | What it demonstrates |
|---|---|
webcam_vision.py |
Webcam + Gemini vision analysis |
webcam_assistant.py |
Chat agent with on-demand webcam tool |
webcam_recording.py |
OpenCV video recording |
webcam_censor.py |
YOLO + censor filter pipeline |
screen_describe.py |
Screen capture + vision description |
screen_assistant_ia.py |
Voice + screen AI assistant |
screen_agent_orchestrated.py |
Multi-agent screen automation |
sip_video_bridge.py |
Bridge two SIP A/V calls (audio + video forwarding) |
sip_video_call.py |
Full SIP audio+video call |
rtp_video_call.py |
Direct RTP video |
sip_anam_avatar.py |
SIP → Anam avatar bridge |
pyav_video_recorder.py |
H.264/NVENC recording |
room_media_recorder.py |
Room-level A/V muxing |
webrtc_video.py |
WebRTC video backend |
websocket_video.py |
WebSocket video transport |