Screen & Vision Tools¶
RoomKit provides ready-to-use AI agent tools that let voice or chat agents see the user's screen and webcam on demand. Instead of continuous video streaming, these tools capture a single frame when the agent needs to look — ideal for document scanning, visual Q&A, and screen assistance.
Architecture¶
Agent decides to look
→ describe_screen / describe_webcam tool call
→ capture_screen_frame() or capture_webcam_frame()
→ VisionProvider (frame → JPEG → AI API → description)
→ result returned to agent as tool output
→ Agent answers user's question with visual context
Installation¶
# Screen capture tools
pip install roomkit[screen-capture] # mss (screenshot)
# Webcam capture tools
pip install roomkit[local-video] # opencv-python-headless
# Vision providers (pick one)
pip install roomkit[openai] # OpenAI GPT-4o
pip install roomkit[gemini] # Gemini Vision
Available Tools¶
| Tool | Purpose | Dependency |
|---|---|---|
DescribeScreenTool |
Capture screenshot and analyze with vision AI | roomkit[screen-capture] |
DescribeWebcamTool |
Capture webcam frame and analyze with vision AI | roomkit[local-video] |
ListWebcamsTool |
Discover available camera devices | roomkit[local-video] |
ScreenInputTools |
Keyboard/mouse control (click, type, scroll) | roomkit[screen-input] |
Quick Start — Webcam Tool¶
A chat agent that can look through the user's webcam when asked. Just pass tool objects directly — definitions and handlers are extracted automatically:
import asyncio
import os
from roomkit import ChannelCategory, InboundMessage, RoomKit, TextContent, WebSocketChannel
from roomkit.providers.anthropic.ai import AnthropicAIProvider
from roomkit.providers.anthropic.config import AnthropicConfig
from roomkit.video.vision.webcam_tool import DescribeWebcamTool, ListWebcamsTool
from roomkit.video.vision.openai import OpenAIVisionConfig, OpenAIVisionProvider
from roomkit.channels.ai import AIChannel
async def main():
vision = OpenAIVisionProvider(OpenAIVisionConfig(
api_key=os.environ["OPENAI_API_KEY"],
base_url="https://api.openai.com/v1",
model="gpt-4o",
))
kit = RoomKit()
ws = WebSocketChannel("ws-user")
ai = AIChannel(
"ai-assistant",
provider=AnthropicAIProvider(
AnthropicConfig(api_key=os.environ["ANTHROPIC_API_KEY"]),
),
system_prompt="You can see through the user's webcam. Use describe_webcam when asked to look.",
tools=[DescribeWebcamTool(vision, device=0), ListWebcamsTool()],
)
kit.register_channel(ws)
kit.register_channel(ai)
await kit.create_room(room_id="demo")
await kit.attach_channel("demo", "ws-user")
await kit.attach_channel("demo", "ai-assistant", category=ChannelCategory.INTELLIGENCE)
# Send a message — the AI will call describe_webcam
await kit.process_inbound(InboundMessage(
channel_id="ws-user", sender_id="user",
content=TextContent(body="What do you see on my webcam?"),
))
asyncio.run(main())
DescribeScreenTool¶
Captures a full-resolution screenshot and analyzes it with a VisionProvider.
from roomkit.video.vision.screen_tool import DescribeScreenTool
from roomkit.video.vision.gemini import GeminiVisionConfig, GeminiVisionProvider
vision = GeminiVisionProvider(GeminiVisionConfig(api_key="AIza..."))
screen_tool = DescribeScreenTool(vision, monitor=1)
Constructor Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
vision |
VisionProvider |
required | Vision provider for frame analysis |
monitor |
int |
1 |
Monitor index (1 = primary, 2+ = secondary) |
Tool Schema¶
The tool exposes one parameter to the agent:
| Parameter | Type | Required | Description |
|---|---|---|---|
query |
string |
yes | Visual question about the screen content |
Usage with RealtimeVoiceChannel¶
voice = RealtimeVoiceChannel(
"voice",
provider=realtime_provider,
transport=backend,
tools=[screen_tool],
)
Programmatic Usage¶
# Capture and analyze
description = await screen_tool.analyze("What browser tabs are open?")
# Low-level: capture only
from roomkit.video.vision.screen_tool import capture_screen_frame
frame = capture_screen_frame(monitor=1) # returns VideoFrame or None
DescribeWebcamTool¶
Captures a single frame from the local webcam and analyzes it. The camera opens, discards a few warmup frames (for auto-exposure), grabs one frame, and releases immediately.
from roomkit.video.vision.webcam_tool import DescribeWebcamTool
from roomkit.video.vision.openai import OpenAIVisionConfig, OpenAIVisionProvider
vision = OpenAIVisionProvider(OpenAIVisionConfig(
api_key="sk-...",
base_url="https://api.openai.com/v1",
model="gpt-4o",
))
webcam_tool = DescribeWebcamTool(vision, device=0)
Constructor Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
vision |
VisionProvider |
required | Vision provider for frame analysis |
device |
int |
0 |
Default camera device index |
Tool Schema¶
The tool exposes three parameters to the agent:
| Parameter | Type | Required | Description |
|---|---|---|---|
query |
string |
yes | Visual question about the webcam image |
device |
integer |
no | Camera device index override |
save_path |
string |
no | File path to save the captured frame as JPEG |
Programmatic Usage¶
# Analyze with default device
description = await webcam_tool.analyze("Read the text on this document")
# Analyze with specific device and save
description = await webcam_tool.analyze(
"What is the user holding?",
device=1,
save_path="/tmp/capture.jpg",
)
Warmup Frames¶
Webcams need a few frames for auto-exposure to settle. Without warmup, the first frame is often dark or over-exposed. capture_webcam_frame discards 5 frames by default:
from roomkit.video.vision.webcam_tool import capture_webcam_frame
# Default: 5 warmup frames
frame = capture_webcam_frame(device=0)
# Skip warmup (virtual cameras that don't need it)
frame = capture_webcam_frame(device=0, warmup_frames=0)
# More warmup for slow cameras
frame = capture_webcam_frame(device=0, warmup_frames=15)
ListWebcamsTool¶
Probes device indices and returns available cameras with their resolution.
from roomkit.video.vision.webcam_tool import ListWebcamsTool
list_tool = ListWebcamsTool(max_devices=10)
# Programmatic
print(list_tool.list())
# Found 2 camera(s):
# Device 0: Camera 0 (AVFoundation) — 1920x1080
# Device 1: Camera 1 (AVFoundation) — 1280x720
Tool Schema¶
No parameters — the agent calls it to discover cameras before choosing one.
Low-level Function¶
from roomkit.video.vision.webcam_tool import list_webcams, WebcamInfo
cameras: list[WebcamInfo] = list_webcams(max_devices=10)
for cam in cameras:
print(f"Device {cam.device}: {cam.name} ({cam.width}x{cam.height})")
Saving Frames¶
Both DescribeWebcamTool and the low-level save_frame function save frames as JPEG:
from roomkit.video.vision.webcam_tool import capture_webcam_frame, save_frame
frame = capture_webcam_frame(device=0)
if frame:
path = save_frame(frame, "/tmp/document_scan.jpg")
print(f"Saved to {path}")
- Parent directories are created automatically
- JPEG encoding uses OpenCV (quality 85) with Pillow fallback
- If the save fails (bad path, read-only filesystem), the error is returned in the tool result so the agent can inform the user
ScreenInputTools¶
For screen assistants that need to interact with the UI — click buttons, type text, scroll, and press keys:
from roomkit.video.vision.screen_input import ScreenInputTools
input_tools = ScreenInputTools(vision=vision, monitor=1)
Requires pip install roomkit[screen-input] (pyautogui).
Keyboard layout support¶
type_text uses clipboard paste instead of raw keystrokes, so it works with all keyboard layouts (AZERTY, QWERTZ, etc.). On macOS it uses pbcopy + Cmd+V, on Linux xclip + Ctrl+V, on Windows clip + Ctrl+V. Falls back to pyautogui.typewrite() if clipboard tools aren't available.
Available Actions¶
| Tool | Description |
|---|---|
click_element |
Vision-assisted click — describes a UI element, vision locates it, clicks the coordinates |
type_text |
Type a string with optional hotkey support |
press_key |
Press a single key or key combination |
scroll |
Scroll up/down by a given amount |
Usage with Voice Agent¶
voice = RealtimeVoiceChannel(
"voice",
provider=realtime_provider,
transport=backend,
tools=[screen_tool, input_tools],
)
Combining Screen + Webcam + Input¶
A full screen assistant with all vision tools:
from roomkit.video.vision.screen_tool import DescribeScreenTool
from roomkit.video.vision.webcam_tool import DescribeWebcamTool, ListWebcamsTool
from roomkit.video.vision.screen_input import ScreenInputTools
from roomkit.video.vision.gemini import GeminiVisionConfig, GeminiVisionProvider
vision = GeminiVisionProvider(GeminiVisionConfig(api_key="AIza..."))
screen = DescribeScreenTool(vision, monitor=1)
webcam = DescribeWebcamTool(vision, device=0)
lister = ListWebcamsTool()
inputs = ScreenInputTools(vision=vision, monitor=1)
# Just pass all tools — definitions and handlers are extracted automatically
all_tools = [screen, webcam, lister, inputs]
Vision Providers¶
All screen and webcam tools accept any VisionProvider. Pick one based on your needs:
| Provider | Install | Best for |
|---|---|---|
GeminiVisionProvider |
roomkit[gemini] |
Fast cloud analysis, good OCR |
OpenAIVisionProvider |
roomkit[openai] |
GPT-4o quality, broad compatibility |
OpenAIVisionProvider (Ollama) |
roomkit[openai] |
Local/private, no API key needed |
MockVisionProvider |
built-in | Testing without API calls |
# Gemini
from roomkit.video.vision.gemini import GeminiVisionConfig, GeminiVisionProvider
vision = GeminiVisionProvider(GeminiVisionConfig(api_key="AIza..."))
# OpenAI GPT-4o
from roomkit.video.vision.openai import OpenAIVisionConfig, OpenAIVisionProvider
vision = OpenAIVisionProvider(OpenAIVisionConfig(
api_key="sk-...",
base_url="https://api.openai.com/v1",
model="gpt-4o",
))
# Ollama (local)
vision = OpenAIVisionProvider(OpenAIVisionConfig(
base_url="http://localhost:11434/v1",
model="qwen3-vl:8b",
))
# Mock (testing)
from roomkit.video.vision.mock import MockVisionProvider
vision = MockVisionProvider(descriptions=["A document with text"])
Examples¶
# Webcam assistant with Anthropic chat + OpenAI vision
ANTHROPIC_API_KEY=sk-... OPENAI_API_KEY=sk-... \
uv run python examples/webcam_assistant.py
# Screen assistant with Gemini voice + vision
GOOGLE_API_KEY=AIza... \
uv run python examples/screen_assistant_ia.py
Use Cases¶
| Scenario | Tools | Description |
|---|---|---|
| Document scanning | DescribeWebcamTool |
User holds document to camera, agent reads and extracts text |
| Screen assistance | DescribeScreenTool + ScreenInputTools |
Agent sees screen, guides user, clicks buttons |
| Object identification | DescribeWebcamTool |
Agent identifies physical objects shown via webcam |
| Multi-camera monitoring | ListWebcamsTool + DescribeWebcamTool |
Agent switches between cameras as needed |
| Visual Q&A | DescribeScreenTool or DescribeWebcamTool |
User asks questions about what's visible |
| Accessibility | DescribeScreenTool |
Agent describes screen content for visually impaired users |
| Remote support | DescribeScreenTool + ScreenInputTools |
Agent sees and controls user's desktop |