Video & Vision¶
RoomKit's video subsystem adds real-time video capture and AI-powered frame analysis to multi-channel rooms. A camera feed can be analyzed by vision AI (Gemini, Ollama, OpenAI) and the results injected into AI conversation context — enabling agents that can "see".
Architecture¶
Camera (LocalVideoBackend)
→ VideoChannel (session lifecycle, hooks, throttled sampling)
→ VisionProvider (frame → JPEG → AI API → VisionResult)
→ setup_video_vision() → AIChannel system prompt
→ AI responds with awareness of the visual scene
Installation¶
# Core (always needed)
pip install roomkit
# Local webcam capture
pip install roomkit[local-video] # opencv-python-headless
# Vision providers (pick one)
pip install roomkit[gemini] # Gemini 2.5 Flash (recommended)
pip install roomkit[openai] # OpenAI GPT-4o / Ollama / vLLM
Quick Start¶
import asyncio
from roomkit import RoomKit, VideoChannel, AIChannel, ChannelCategory
from roomkit.video.vision.gemini import GeminiVisionConfig, GeminiVisionProvider
from roomkit.providers.ai.mock import MockAIProvider
from roomkit.video.ai_integration import setup_video_vision
from roomkit.video.backends.local import LocalVideoBackend
async def main():
kit = RoomKit()
# Video: webcam + Gemini vision
backend = LocalVideoBackend(device=0, fps=15)
vision = GeminiVisionProvider(GeminiVisionConfig(api_key="AIza..."))
video = VideoChannel("video", backend=backend, vision=vision)
kit.register_channel(video)
# AI agent
ai = AIChannel("ai", provider=MockAIProvider(responses=["I see!"]))
kit.register_channel(ai)
# Room: wire video → AI
await kit.create_room(room_id="demo")
await kit.attach_channel("demo", "video")
await kit.attach_channel("demo", "ai", category=ChannelCategory.INTELLIGENCE)
setup_video_vision(kit, room_id="demo", ai_channel_id="ai")
# Start capturing (previously connect_video(), now unified as join())
session = await kit.join("demo", "video", participant_id="user-1")
await backend.start_capture(session)
await asyncio.sleep(30) # Run for 30 seconds
await backend.stop_capture(session)
await kit.close()
asyncio.run(main())
Components¶
VideoBackend¶
Transport abstraction for video sources. Two implementations:
| Backend | Use case | Install |
|---|---|---|
LocalVideoBackend |
Webcam capture for dev/testing | roomkit[local-video] |
ScreenCaptureBackend |
Screen/monitor capture | roomkit[screen-capture] |
MockVideoBackend |
Unit tests | Built-in |
from roomkit.video.backends.local import LocalVideoBackend
backend = LocalVideoBackend(
device=0, # Camera index
fps=15, # Capture rate
width=640, # Resolution
height=480,
)
VisionProvider¶
Analyzes video frames and returns structured descriptions. Three implementations:
| Provider | Backend | Install |
|---|---|---|
GeminiVisionProvider |
Google Gemini 2.5 Flash | roomkit[gemini] |
OpenAIVisionProvider |
OpenAI, Ollama, vLLM | roomkit[openai] |
MockVisionProvider |
Unit tests | Built-in |
Gemini (recommended for speed)¶
from roomkit.video.vision.gemini import GeminiVisionConfig, GeminiVisionProvider
vision = GeminiVisionProvider(GeminiVisionConfig(
api_key="AIza...",
model="gemini-2.5-flash",
))
Ollama (local, free)¶
from roomkit.video.vision.openai import OpenAIVisionConfig, OpenAIVisionProvider
vision = OpenAIVisionProvider(OpenAIVisionConfig(
base_url="http://localhost:11434/v1",
model="qwen3-vl:8b",
api_key="ollama",
))
VideoChannel¶
Session-based channel that wires a VideoBackend to a room:
- Manages session lifecycle (connect, bind, disconnect)
- Fires
ON_VIDEO_SESSION_STARTED/ON_VIDEO_SESSION_ENDEDhooks - Runs
VisionProviderat configurable intervals - Emits
video_vision_resultframework events
video = VideoChannel(
"video-main",
backend=backend,
vision=vision,
vision_interval_ms=3000, # Analyze every 3 seconds
)
AI Integration¶
setup_video_vision() wires vision results into an AIChannel's system prompt:
from roomkit.video.ai_integration import setup_video_vision
setup_video_vision(
kit,
room_id="my-room",
ai_channel_id="ai",
context_prefix="You can see a live camera feed. Current view:",
)
The AI's system prompt is automatically updated with the latest vision description every time a frame is analyzed. The base system prompt is preserved.
Realtime Voice Integration¶
For RealtimeVoiceChannel (Gemini Live, OpenAI Realtime), use setup_realtime_vision() instead. It injects vision descriptions via inject_text(silent=True) on active voice sessions:
from roomkit import setup_realtime_vision
setup_realtime_vision(
kit,
room_id="my-room",
voice_channel_id="voice",
context_prefix="You can see the screen. Current view:",
)
Key differences from setup_video_vision():
- Delivers via
inject_text(silent=True)— adds context without triggering a response - Dedup built in — unchanged descriptions are not re-injected
- Works with all active sessions in the room (multi-participant)
Video Recording¶
Two recorder implementations are available:
| Recorder | Codec | Compression | Install |
|---|---|---|---|
PyAVVideoRecorder |
H.264 / H.265 / NVENC | High (10-50x smaller) | roomkit[video] |
OpenCVVideoRecorder |
mp4v (MPEG-4 Part 2) | Low | roomkit[local-video] |
MockVideoRecorder |
None | In-memory | Built-in |
PyAV (recommended for production) — compressed H.264 MP4 via FFmpeg:
from roomkit.video.pipeline.config import VideoPipelineConfig
from roomkit.video.recorder.pyav import PyAVVideoRecorder
from roomkit.video.recorder import VideoRecordingConfig
recorder = PyAVVideoRecorder()
config = VideoRecordingConfig(storage="./recordings", codec="auto")
video = VideoChannel(
"video",
backend=backend,
pipeline=VideoPipelineConfig(recorder=recorder, recording_config=config),
)
OpenCV — quick path, larger files:
from roomkit.video.pipeline.config import VideoPipelineConfig
from roomkit.video.recorder.opencv import OpenCVVideoRecorder
from roomkit.video.recorder import VideoRecordingConfig
recorder = OpenCVVideoRecorder()
config = VideoRecordingConfig(storage="./recordings")
video = VideoChannel(
"video",
backend=backend,
pipeline=VideoPipelineConfig(recorder=recorder, recording_config=config),
)
Recording starts automatically when a session binds and stops when it unbinds. Every received frame is tapped to the recorder and the file is finalized on stop.
See the PyAV Video Recorder guide for codec selection, NVENC hardware encoding, and compression details.
# PyAV recorder (H.264)
uv run python examples/pyav_video_recorder.py
uv run python examples/pyav_video_recorder.py --codec h264_nvenc --duration 30
# OpenCV recorder (raw)
uv run python examples/webcam_recording.py
Hook Triggers¶
| Trigger | Fires when |
|---|---|
ON_VIDEO_SESSION_STARTED |
Video session becomes live |
ON_VIDEO_SESSION_ENDED |
Video session disconnects |
ON_VIDEO_TRACK_ADDED |
New video track added (future) |
ON_VIDEO_TRACK_REMOVED |
Video track removed (future) |
ON_SCREEN_SHARE_STARTED |
Screen share begins (future) |
ON_SCREEN_SHARE_STOPPED |
Screen share ends (future) |
@kit.hook(HookTrigger.ON_VIDEO_SESSION_STARTED, HookExecution.ASYNC)
async def on_video_start(event, ctx):
print(f"Video session started: {event.session.id}")
Example¶
Run the webcam vision example:
# Mock mode (no API needed)
uv run python examples/webcam_vision.py
# Gemini (fast cloud API)
export GEMINI_API_KEY=AIza...
uv run python examples/webcam_vision.py --gemini
# Ollama (local)
uv run python examples/webcam_vision.py --ollama --model qwen3-vl:8b
# French output
uv run python examples/webcam_vision.py --gemini --lang fr
Use Cases¶
| Scenario | Configuration |
|---|---|
| Security camera | VideoChannel + VisionProvider → text alerts |
| SIP video interphone | VideoChannel + VoiceChannel → AI sees and talks |
| Quality inspection | VideoChannel + VisionProvider → defect detection |
| Accessibility | VideoChannel → AI describes visual scene for blind users |
| Screen assistant | ScreenCaptureBackend + VisionProvider → AI-guided software usage |