Screen & Vision Tools¶

RoomKit provides ready-to-use AI agent tools that let voice or chat agents see the user's screen and webcam on demand. Instead of continuous video streaming, these tools capture a single frame when the agent needs to look — ideal for document scanning, visual Q&A, and screen assistance.

Architecture¶

Agent decides to look
  → describe_screen / describe_webcam tool call
    → capture_screen_frame() or capture_webcam_frame()
      → VisionProvider (frame → JPEG → AI API → description)
    → result returned to agent as tool output
  → Agent answers user's question with visual context

Installation¶

# Screen capture tools
pip install roomkit[screen-capture]    # mss (screenshot)

# Webcam capture tools
pip install roomkit[local-video]       # opencv-python-headless

# Vision providers (pick one)
pip install roomkit[openai]            # OpenAI GPT-4o
pip install roomkit[gemini]            # Gemini Vision

Available Tools¶

Tool	Purpose	Dependency
`DescribeScreenTool`	Capture screenshot and analyze with vision AI	`roomkit[screen-capture]`
`DescribeWebcamTool`	Capture webcam frame and analyze with vision AI	`roomkit[local-video]`
`ListWebcamsTool`	Discover available camera devices	`roomkit[local-video]`
`ScreenInputTools`	Keyboard/mouse control (click, type, scroll)	`roomkit[screen-input]`

Quick Start — Webcam Tool¶

A chat agent that can look through the user's webcam when asked. Just pass tool objects directly — definitions and handlers are extracted automatically:

import asyncio
import os

from roomkit import ChannelCategory, InboundMessage, RoomKit, TextContent, WebSocketChannel
from roomkit.providers.anthropic.ai import AnthropicAIProvider
from roomkit.providers.anthropic.config import AnthropicConfig
from roomkit.video.vision.webcam_tool import DescribeWebcamTool, ListWebcamsTool
from roomkit.video.vision.openai import OpenAIVisionConfig, OpenAIVisionProvider
from roomkit.channels.ai import AIChannel

async def main():
    vision = OpenAIVisionProvider(OpenAIVisionConfig(
        api_key=os.environ["OPENAI_API_KEY"],
        base_url="https://api.openai.com/v1",
        model="gpt-4o",
    ))

    kit = RoomKit()
    ws = WebSocketChannel("ws-user")
    ai = AIChannel(
        "ai-assistant",
        provider=AnthropicAIProvider(
            AnthropicConfig(api_key=os.environ["ANTHROPIC_API_KEY"]),
        ),
        system_prompt="You can see through the user's webcam. Use describe_webcam when asked to look.",
        tools=[DescribeWebcamTool(vision, device=0), ListWebcamsTool()],
    )
    kit.register_channel(ws)
    kit.register_channel(ai)

    await kit.create_room(room_id="demo")
    await kit.attach_channel("demo", "ws-user")
    await kit.attach_channel("demo", "ai-assistant", category=ChannelCategory.INTELLIGENCE)

    # Send a message — the AI will call describe_webcam
    await kit.process_inbound(InboundMessage(
        channel_id="ws-user", sender_id="user",
        content=TextContent(body="What do you see on my webcam?"),
    ))

asyncio.run(main())

DescribeScreenTool¶

Captures a full-resolution screenshot and analyzes it with a VisionProvider.

from roomkit.video.vision.screen_tool import DescribeScreenTool
from roomkit.video.vision.gemini import GeminiVisionConfig, GeminiVisionProvider

vision = GeminiVisionProvider(GeminiVisionConfig(api_key="AIza..."))
screen_tool = DescribeScreenTool(vision, monitor=1)

Constructor Parameters¶

Parameter	Type	Default	Description
`vision`	`VisionProvider`	required	Vision provider for frame analysis
`monitor`	`int`	`1`	Monitor index (1 = primary, 2+ = secondary)

Tool Schema¶

The tool exposes one parameter to the agent:

Parameter	Type	Required	Description
`query`	`string`	yes	Visual question about the screen content

Usage with RealtimeVoiceChannel¶

voice = RealtimeVoiceChannel(
    "voice",
    provider=realtime_provider,
    transport=backend,
    tools=[screen_tool],
)

Programmatic Usage¶

# Capture and analyze
description = await screen_tool.analyze("What browser tabs are open?")

# Low-level: capture only
from roomkit.video.vision.screen_tool import capture_screen_frame
frame = capture_screen_frame(monitor=1)  # returns VideoFrame or None

DescribeWebcamTool¶

Captures a single frame from the local webcam and analyzes it. The camera opens, discards a few warmup frames (for auto-exposure), grabs one frame, and releases immediately.

from roomkit.video.vision.webcam_tool import DescribeWebcamTool
from roomkit.video.vision.openai import OpenAIVisionConfig, OpenAIVisionProvider

vision = OpenAIVisionProvider(OpenAIVisionConfig(
    api_key="sk-...",
    base_url="https://api.openai.com/v1",
    model="gpt-4o",
))
webcam_tool = DescribeWebcamTool(vision, device=0)

Constructor Parameters¶

Parameter	Type	Default	Description
`vision`	`VisionProvider`	required	Vision provider for frame analysis
`device`	`int`	`0`	Default camera device index

Tool Schema¶

The tool exposes three parameters to the agent:

Parameter	Type	Required	Description
`query`	`string`	yes	Visual question about the webcam image
`device`	`integer`	no	Camera device index override
`save_path`	`string`	no	File path to save the captured frame as JPEG

Programmatic Usage¶

# Analyze with default device
description = await webcam_tool.analyze("Read the text on this document")

# Analyze with specific device and save
description = await webcam_tool.analyze(
    "What is the user holding?",
    device=1,
    save_path="/tmp/capture.jpg",
)

Warmup Frames¶

Webcams need a few frames for auto-exposure to settle. Without warmup, the first frame is often dark or over-exposed. capture_webcam_frame discards 5 frames by default:

from roomkit.video.vision.webcam_tool import capture_webcam_frame

# Default: 5 warmup frames
frame = capture_webcam_frame(device=0)

# Skip warmup (virtual cameras that don't need it)
frame = capture_webcam_frame(device=0, warmup_frames=0)

# More warmup for slow cameras
frame = capture_webcam_frame(device=0, warmup_frames=15)

ListWebcamsTool¶

Probes device indices and returns available cameras with their resolution.

from roomkit.video.vision.webcam_tool import ListWebcamsTool

list_tool = ListWebcamsTool(max_devices=10)

# Programmatic
print(list_tool.list())
# Found 2 camera(s):
#   Device 0: Camera 0 (AVFoundation) — 1920x1080
#   Device 1: Camera 1 (AVFoundation) — 1280x720

Tool Schema¶

No parameters — the agent calls it to discover cameras before choosing one.

Low-level Function¶

from roomkit.video.vision.webcam_tool import list_webcams, WebcamInfo

cameras: list[WebcamInfo] = list_webcams(max_devices=10)
for cam in cameras:
    print(f"Device {cam.device}: {cam.name} ({cam.width}x{cam.height})")

Saving Frames¶

Both DescribeWebcamTool and the low-level save_frame function save frames as JPEG:

from roomkit.video.vision.webcam_tool import capture_webcam_frame, save_frame

frame = capture_webcam_frame(device=0)
if frame:
    path = save_frame(frame, "/tmp/document_scan.jpg")
    print(f"Saved to {path}")

Parent directories are created automatically
JPEG encoding uses OpenCV (quality 85) with Pillow fallback
If the save fails (bad path, read-only filesystem), the error is returned in the tool result so the agent can inform the user

ScreenInputTools¶

For screen assistants that need to interact with the UI — click buttons, type text, scroll, and press keys:

from roomkit.video.vision.screen_input import ScreenInputTools

input_tools = ScreenInputTools(vision=vision, monitor=1)

Requires pip install roomkit[screen-input] (pyautogui).

Keyboard layout support¶

type_text uses clipboard paste instead of raw keystrokes, so it works with all keyboard layouts (AZERTY, QWERTZ, etc.). On macOS it uses pbcopy + Cmd+V, on Linux xclip + Ctrl+V, on Windows clip + Ctrl+V. Falls back to pyautogui.typewrite() if clipboard tools aren't available.

Available Actions¶

Tool	Description
`click_element`	Vision-assisted click — describes a UI element, vision locates it, clicks the coordinates
`type_text`	Type a string with optional hotkey support
`press_key`	Press a single key or key combination
`scroll`	Scroll up/down by a given amount

Usage with Voice Agent¶

voice = RealtimeVoiceChannel(
    "voice",
    provider=realtime_provider,
    transport=backend,
    tools=[screen_tool, input_tools],
)

Combining Screen + Webcam + Input¶

A full screen assistant with all vision tools:

from roomkit.video.vision.screen_tool import DescribeScreenTool
from roomkit.video.vision.webcam_tool import DescribeWebcamTool, ListWebcamsTool
from roomkit.video.vision.screen_input import ScreenInputTools
from roomkit.video.vision.gemini import GeminiVisionConfig, GeminiVisionProvider

vision = GeminiVisionProvider(GeminiVisionConfig(api_key="AIza..."))

screen = DescribeScreenTool(vision, monitor=1)
webcam = DescribeWebcamTool(vision, device=0)
lister = ListWebcamsTool()
inputs = ScreenInputTools(vision=vision, monitor=1)

# Just pass all tools — definitions and handlers are extracted automatically
all_tools = [screen, webcam, lister, inputs]

Vision Providers¶

All screen and webcam tools accept any VisionProvider. Pick one based on your needs:

Provider	Install	Best for
`GeminiVisionProvider`	`roomkit[gemini]`	Fast cloud analysis, good OCR
`OpenAIVisionProvider`	`roomkit[openai]`	GPT-4o quality, broad compatibility
`OpenAIVisionProvider` (Ollama)	`roomkit[openai]`	Local/private, no API key needed
`MockVisionProvider`	built-in	Testing without API calls

# Gemini
from roomkit.video.vision.gemini import GeminiVisionConfig, GeminiVisionProvider
vision = GeminiVisionProvider(GeminiVisionConfig(api_key="AIza..."))

# OpenAI GPT-4o
from roomkit.video.vision.openai import OpenAIVisionConfig, OpenAIVisionProvider
vision = OpenAIVisionProvider(OpenAIVisionConfig(
    api_key="sk-...",
    base_url="https://api.openai.com/v1",
    model="gpt-4o",
))

# Ollama (local)
vision = OpenAIVisionProvider(OpenAIVisionConfig(
    base_url="http://localhost:11434/v1",
    model="qwen3-vl:8b",
))

# Mock (testing)
from roomkit.video.vision.mock import MockVisionProvider
vision = MockVisionProvider(descriptions=["A document with text"])

Examples¶

# Webcam assistant with Anthropic chat + OpenAI vision
ANTHROPIC_API_KEY=sk-... OPENAI_API_KEY=sk-... \
    uv run python examples/webcam_assistant.py

# Screen assistant with Gemini voice + vision
GOOGLE_API_KEY=AIza... \
    uv run python examples/screen_assistant_ia.py

Use Cases¶

Scenario	Tools	Description
Document scanning	`DescribeWebcamTool`	User holds document to camera, agent reads and extracts text
Screen assistance	`DescribeScreenTool` + `ScreenInputTools`	Agent sees screen, guides user, clicks buttons
Object identification	`DescribeWebcamTool`	Agent identifies physical objects shown via webcam
Multi-camera monitoring	`ListWebcamsTool` + `DescribeWebcamTool`	Agent switches between cameras as needed
Visual Q&A	`DescribeScreenTool` or `DescribeWebcamTool`	User asks questions about what's visible
Accessibility	`DescribeScreenTool`	Agent describes screen content for visually impaired users
Remote support	`DescribeScreenTool` + `ScreenInputTools`	Agent sees and controls user's desktop