Screen Capture Backend¶

RoomKit's ScreenCaptureBackend captures frames from your screen (or a region of it) and delivers them as standard VideoFrame objects. Combined with a VisionProvider and AIChannel, it enables AI-powered screen assistants that can see, describe, and guide users through software.

Architecture¶

Screen (mss)
  → ScreenCaptureBackend (background thread, throttled FPS)
    → optional diff-based frame skipping
    → optional downscaling
  → VideoChannel (session lifecycle, hooks, throttled vision sampling)
  → VisionProvider (frame → JPEG → AI API → VisionResult with OCR)
  → setup_video_vision() → AIChannel system prompt
  → AI responds with awareness of what's on screen

Installation¶

# Screen capture (always needed)
pip install roomkit[screen-capture]    # mss

# Vision providers (pick one)
pip install roomkit[gemini]            # Gemini (recommended)
pip install roomkit[openai]            # OpenAI GPT-4o / Ollama / vLLM

Quick Start¶

import asyncio
from roomkit import RoomKit, VideoChannel, AIChannel, ChannelCategory
from roomkit.video.vision.gemini import GeminiVisionConfig, GeminiVisionProvider
from roomkit.providers.ai.mock import MockAIProvider
from roomkit.video.ai_integration import setup_video_vision
from roomkit.video.backends.screen import ScreenCaptureBackend

async def main():
    kit = RoomKit()

    # Screen capture at 2 FPS, half resolution
    backend = ScreenCaptureBackend(monitor=1, fps=2, scale=0.5)

    # Vision AI
    vision = GeminiVisionProvider(GeminiVisionConfig(api_key="AIza..."))

    # Video channel with 5-second analysis interval
    video = VideoChannel("video", backend=backend, vision=vision, vision_interval_ms=5000)
    kit.register_channel(video)

    # AI channel
    ai = AIChannel("ai", provider=MockAIProvider(responses=["I can see your screen!"]))
    kit.register_channel(ai)

    # Wire everything together
    await kit.create_room(room_id="screen-demo")
    await kit.attach_channel("screen-demo", "video")
    await kit.attach_channel("screen-demo", "ai", category=ChannelCategory.INTELLIGENCE)
    setup_video_vision(kit, room_id="screen-demo", ai_channel_id="ai")

    # Start capturing (previously connect_video(), now unified as join())
    session = await kit.join("screen-demo", "video", participant_id="user-1")
    await backend.start_capture(session)

    await asyncio.sleep(60)
    await backend.stop_capture(session)
    await kit.close()

asyncio.run(main())

Constructor Parameters¶

ScreenCaptureBackend(
    monitor=1,
    region=None,
    fps=5,
    scale=1.0,
    diff_threshold=0.0,
)

Parameter	Type	Default	Description
`monitor`	`int`	`1`	Monitor index. `0` = all monitors combined, `1` = primary, `2`+ = secondary.
`region`	`tuple[int,int,int,int] \\| None`	`None`	Crop region as `(left, top, width, height)` in pixels. Overrides `monitor`.
`fps`	`int`	`5`	Capture frame rate. Lower than webcam — screens change less often.
`scale`	`float`	`1.0`	Downscale factor in `(0.0, 1.0]`. `0.5` = half resolution. Saves bandwidth and vision API tokens.
`diff_threshold`	`float`	`0.0`	Skip frames where pixel change is below this percentage (0.0–1.0). `0.0` = disabled, `0.02` = skip when <2% changed.

Features¶

Monitor Selection¶

mss enumerates all connected monitors. Index 0 is the union of all displays (virtual desktop), 1 is the primary, and higher indices are additional monitors.

# Primary monitor
backend = ScreenCaptureBackend(monitor=1)

# Secondary monitor
backend = ScreenCaptureBackend(monitor=2)

# All monitors as one image
backend = ScreenCaptureBackend(monitor=0)

Region Cropping¶

Capture a specific area of the screen instead of the full monitor:

# Capture a 800x600 region starting at (100, 200)
backend = ScreenCaptureBackend(region=(100, 200, 800, 600))

When region is set, the monitor parameter is ignored.

Downscaling¶

Reduce frame resolution to save bandwidth and vision API tokens. Scaling uses numpy stride slicing (nearest-neighbor):

# Half resolution — a 1920x1080 screen becomes 960x540
backend = ScreenCaptureBackend(scale=0.5)

# Quarter resolution — good for vision analysis where detail isn't critical
backend = ScreenCaptureBackend(scale=0.25)

Tip

For vision analysis, scale=0.5 is usually sufficient. UI elements and text remain readable, and API costs are reduced ~4x.

Diff-Based Frame Skipping¶

Screens are mostly static. Enable diff-based skipping to avoid sending identical frames:

# Skip frames where less than 2% of pixels changed
backend = ScreenCaptureBackend(fps=5, diff_threshold=0.02)

The diff check samples a sparse subset of pixels (every 300th byte) for O(1) comparison regardless of resolution. When the frame is too similar to the previous one, it's dropped before reaching the on_video_received callback.

`diff_threshold`	Behavior
`0.0` (default)	All frames delivered
`0.01`	Skip if <1% changed (very sensitive)
`0.02`	Skip if <2% changed (recommended)
`0.05`	Skip if <5% changed (only major changes)

Vision Integration¶

Wire the screen capture to a VisionProvider for AI-powered screen understanding:

from roomkit.video.vision.openai import OpenAIVisionConfig, OpenAIVisionProvider

# Ollama with a vision model
vision = OpenAIVisionProvider(OpenAIVisionConfig(
    base_url="http://localhost:11434/v1",
    model="qwen3-vl:8b",
    api_key="ollama",
    prompt=(
        "Describe what software is shown on this screen. "
        "Include visible UI elements, menus, buttons, and text."
    ),
))

video = VideoChannel(
    "video-screen",
    backend=backend,
    vision=vision,
    vision_interval_ms=5000,  # Analyze every 5 seconds
)

The VisionResult includes a text field with OCR output — useful for reading UI labels, error messages, and form fields from the screen.

Recording Screen Capture¶

Combine with a video recorder to save screen captures to MP4:

from roomkit.video.pipeline.config import VideoPipelineConfig
from roomkit.video.recorder.pyav import PyAVVideoRecorder
from roomkit.video.recorder import VideoRecordingConfig

recorder = PyAVVideoRecorder()
config = VideoRecordingConfig(storage="./recordings", codec="auto")

video = VideoChannel(
    "video-screen",
    backend=ScreenCaptureBackend(monitor=1, fps=10),
    pipeline=VideoPipelineConfig(recorder=recorder, recording_config=config),
)

See the PyAV Video Recorder guide for codec options and hardware encoding.

Hook Triggers¶

Trigger	Fires when
`ON_VIDEO_SESSION_STARTED`	Screen capture session becomes live
`ON_VIDEO_SESSION_ENDED`	Screen capture session disconnects

@kit.hook(HookTrigger.ON_VIDEO_SESSION_STARTED, execution=HookExecution.ASYNC)
async def on_screen_started(event, ctx):
    print(f"Screen capture started: {event.session.id}")

Capability¶

ScreenCaptureBackend declares VideoCapability.SCREEN_SHARE. Use this to distinguish screen capture from webcam sessions:

from roomkit.video.base import VideoCapability

if VideoCapability.SCREEN_SHARE in backend.capabilities:
    print("This is a screen share session")

Backend Comparison¶

Backend	Source	Install	Typical FPS	Direction
`ScreenCaptureBackend`	Screen / monitor / region	`roomkit[screen-capture]`	1–10	Inbound only
`LocalVideoBackend`	Webcam (OpenCV)	`roomkit[local-video]`	15–30	Inbound only
`RTPVideoBackend`	RTP network stream	`roomkit[rtp]`	30	Bidirectional
`SIPVideoBackend`	SIP A/V call	`roomkit[sip]`	30	Bidirectional
`MockVideoBackend`	Simulated frames	Built-in	N/A	Testing

Example¶

Run the screen description example:

# Mock mode (no API needed)
uv run python examples/screen_describe.py

# Gemini (fast cloud API)
export GEMINI_API_KEY=AIza...
uv run python examples/screen_describe.py --gemini

# Ollama (local)
uv run python examples/screen_describe.py --ollama --model qwen3-vl:8b

# Secondary monitor, quarter resolution
uv run python examples/screen_describe.py --gemini --monitor 2 --scale 0.25

# High frequency with diff skipping
uv run python examples/screen_describe.py --gemini --fps 10 --diff 0.02

Use Cases¶

Scenario	Configuration
AI screen assistant	ScreenCapture + VisionProvider + AIChannel → guided software usage
Screen recording	ScreenCapture + PyAVVideoRecorder → MP4 screencasts
Accessibility	ScreenCapture + VisionProvider → AI describes screen for visually impaired users
Remote support	ScreenCapture + VisionProvider + VoiceChannel → AI sees screen and talks to user
Training & onboarding	ScreenCapture + VisionProvider → step-by-step software tutorials
Quality assurance	ScreenCapture + VisionProvider → automated UI state verification