Skip to content

Anam AI Avatar (Realtime Audio+Video)

RoomKit integrates with Anam AI to deliver photorealistic talking-head avatars in real-time. Anam handles STT, LLM reasoning, TTS, and face animation in the cloud — delivering synchronized audio and video over WebRTC.

Architecture

Two integration patterns are available:

RealtimeAudioVideoChannel (room-based)

For applications that need RoomKit's full room model (hooks, multi-channel, events):

User audio → Transport (WS) → RealtimeAudioVideoChannel → hooks/events
                                AnamRealtimeProvider
                                       ↕  (WebRTC)
                                   Anam Cloud
                                STT → LLM → TTS → Avatar
                              audio + video frames

RealtimeAVBridge (direct backend wiring)

For bridging any voice/video backend (SIP, RTP, local mic) directly to Anam — no room model needed:

SIP phone → RTP audio → RealtimeAVBridge → provider.send_audio() → Anam STT
                       AnamRealtimeProvider (WebRTC)
                         Anam Cloud
                    STT → LLM → TTS → Avatar
              audio → resample → SIP RTP → phone speaker
              video → pipeline → H.264 encode → SIP RTP → phone screen

The bridge handles audio resampling (48kHz → SIP codec rate), video encoding (raw RGB → H.264), session lifecycle, and graceful cleanup.


Installation

pip install roomkit[anam]
# With SIP transport + video encoding:
pip install roomkit[anam,sip,video]

Quick Start

from roomkit.providers.anam.config import AnamConfig
from roomkit.providers.anam.realtime import AnamRealtimeProvider
from roomkit.voice.realtime.bridge import RealtimeAVBridge
from roomkit.video.backends.sip import SIPVideoBackend
from roomkit.video.pipeline.encoder.pyav import PyAVVideoEncoder

sip = SIPVideoBackend(
    local_sip_addr=("0.0.0.0", 5060),
    local_rtp_ip="0.0.0.0",
    supported_video_codecs=["H264"],
)

provider = AnamRealtimeProvider(AnamConfig(
    api_key="your-api-key",
    avatar_id="your-avatar-id",
    voice_id="your-voice-id",
    llm_id="your-llm-id",
))

bridge = RealtimeAVBridge(
    provider, sip,
    encoder=PyAVVideoEncoder(fps=25),
)

await sip.start()
# Incoming SIP calls are now connected to Anam automatically.

AnamConfig

Two persona modes are supported:

Pre-defined Persona (from Anam Lab)

config = AnamConfig(
    api_key="your-api-key",
    persona_id="your-persona-id",
)

Ephemeral Persona (full control)

Configure individual components from lab.anam.ai:

config = AnamConfig(
    api_key="your-api-key",
    avatar_id="30fa96d0-...",         # from lab.anam.ai/avatars
    avatar_model="cara-3",            # optional model variant
    voice_id="6bfbe25a-...",          # from lab.anam.ai/voices
    llm_id="0934d97d-...",            # from lab.anam.ai/llms
    system_prompt="You are a concise assistant.",
    language_code="fr",               # BCP-47 language code
    enable_audio_passthrough=False,    # False = Anam manages TTS context
    timeout=30.0,
)
Parameter Description
api_key Anam API key (required)
persona_id Pre-defined persona from Anam Lab
avatar_id Avatar face model ID
avatar_model Video frame model (e.g. "cara-3")
voice_id TTS voice ID
llm_id LLM model ID
system_prompt System instructions for the LLM
language_code BCP-47 language code (default: "en")
enable_audio_passthrough Bypass Anam's TTS context tracking
timeout Connection timeout in seconds (default: 30.0)

RealtimeAVBridge

Generic bridge that wires any VoiceBackend to any RealtimeAudioVideoProvider. Handles:

  • Audio in: Backend → provider.send_audio() (user speech → AI)
  • Audio out: Provider → resample (48kHz → codec rate) → backend.send_audio() (AI speech → user)
  • Video out: Provider → pipeline → encode → backend RTP (avatar → user)
  • Session lifecycle: Auto-connect on SIP INVITE, auto-disconnect on BYE
from roomkit.voice.realtime.bridge import RealtimeAVBridge
from roomkit.video.pipeline.encoder.pyav import PyAVVideoEncoder
from roomkit.video.utils import make_text_frame

bridge = RealtimeAVBridge(
    provider,                            # Any RealtimeAudioVideoProvider
    backend,                             # Any VoiceBackend (SIP, RTP, local)
    encoder=PyAVVideoEncoder(fps=25),    # None = passthrough (taps only)
    video_pipeline=pipeline_config,      # Optional VideoPipelineConfig
    connecting_frame=make_text_frame(    # Shown during provider negotiation
        "Connecting to avatar...\nPlease wait",
    ),
    provider_sample_rate=48000,          # Anam outputs 48kHz
    on_transcription=my_handler,         # Optional callbacks
)

Connecting Placeholder

Anam's WebRTC negotiation takes 3-5 seconds. During this time, the caller sees a black screen by default. Use connecting_frame to show a placeholder instead:

from roomkit.video.utils import make_text_frame

bridge = RealtimeAVBridge(
    provider, sip,
    encoder=PyAVVideoEncoder(fps=25),
    connecting_frame=make_text_frame("Connecting to avatar...\nPlease wait"),
    connecting_fps=5,  # Low fps is fine for a static image
)

make_text_frame() generates a VideoFrame with centered multi-line text on a dark background. The placeholder is sent at connecting_fps (default 5) until the provider connects, then real avatar frames take over automatically.

You can customize the frame appearance:

make_text_frame(
    "Bienvenue\nConnexion en cours...",
    width=720, height=480,
    bg_color=(20, 20, 50),         # Dark blue background
    text_color=(255, 255, 255),    # White title
    sub_color=(180, 180, 200),     # Light grey subtitle
    font_scale=1.0,
)

Video Pipeline

Insert processing stages between provider frames and the encoder:

from roomkit.video.pipeline.config import VideoPipelineConfig
from roomkit.video.pipeline.filter.watermark import WatermarkFilter

bridge = RealtimeAVBridge(
    provider, sip,
    video_pipeline=VideoPipelineConfig(
        filters=[
            WatermarkFilter(
                "RoomKit | {timestamp}",
                position="bottom-left",
                color=(255, 255, 255),
                bg_color=(0, 0, 0),
            ),
        ],
    ),
    encoder=PyAVVideoEncoder(fps=25, bitrate=3_000_000),
)

The pipeline supports all standard stages: resizer, filters (watermark, YOLO, censor), transforms, and vision analysis.

Video Encoder

The encoder converts raw RGB frames to H.264 for SIP/RTP:

from roomkit.video.pipeline.encoder.pyav import PyAVVideoEncoder

# Default: 2.5 Mbps, veryfast preset
encoder = PyAVVideoEncoder(fps=25)

# Higher quality for LAN
encoder = PyAVVideoEncoder(fps=25, bitrate=3_000_000, preset="medium")

Without an encoder, video goes to taps only (passthrough mode for local display).

Video Taps

Register callbacks that receive every video frame:

bridge.add_video_tap(lambda session, frame: print(f"{frame.width}x{frame.height}"))

Manual Connect (local mic/webcam)

For backends that don't fire on_call automatically:

session = await bridge.connect(backend_session, participant_id="user-1")
# ...
await bridge.disconnect(session.id)

SIP Integration

Bridge SIP phone calls to Anam — the caller talks to a photorealistic AI avatar:

from roomkit.providers.anam.config import AnamConfig
from roomkit.providers.anam.realtime import AnamRealtimeProvider
from roomkit.video.pipeline.config import VideoPipelineConfig
from roomkit.video.backends.sip import SIPVideoBackend
from roomkit.video.pipeline.encoder.pyav import PyAVVideoEncoder
from roomkit.video.pipeline.filter.watermark import WatermarkFilter
from roomkit.video.utils import make_text_frame
from roomkit.voice.realtime.bridge import RealtimeAVBridge

sip = SIPVideoBackend(
    local_sip_addr=("0.0.0.0", 5060),
    local_rtp_ip="0.0.0.0",
    supported_video_codecs=["H264"],
)

provider = AnamRealtimeProvider(AnamConfig(
    api_key="...",
    avatar_id="...",
    voice_id="...",
    llm_id="...",
    language_code="fr",
    system_prompt="Tu es un avatar IA serviable.",
))

bridge = RealtimeAVBridge(
    provider, sip,
    video_pipeline=VideoPipelineConfig(
        filters=[WatermarkFilter("RoomKit | {timestamp}")],
    ),
    connecting_frame=make_text_frame("Connexion en cours...\nVeuillez patienter"),
    encoder=PyAVVideoEncoder(fps=25, bitrate=3_000_000),
    on_transcription=lambda role, text, _: print(f"[{role}] {text}"),
)

await sip.start()

See examples/sip_anam_avatar.py for a complete runnable example with signal handling and graceful shutdown.

Known Characteristics

  • Startup delay (~4-5s): Anam's WebRTC negotiation (ICE gathering, TURN allocation). The SIP phone answers immediately but the avatar needs time to connect. This is inherent to the Anam SDK.
  • Video re-encoding: The Anam SDK decodes WebRTC video to raw pixels before exposing frames. RoomKit re-encodes to H.264 for SIP. This is a limitation of the SDK (no raw H.264 access).
  • Audio resampling: Anam outputs 48kHz stereo PCM (WebRTC OPUS decoded). The bridge resamples to the SIP codec rate (8kHz G.711, 16kHz G.722) automatically.

RealtimeAudioVideoChannel (Room-Based)

For applications that need RoomKit's room model:

from roomkit import RealtimeAudioVideoChannel, RoomKit
from roomkit.providers.anam.config import AnamConfig
from roomkit.providers.anam.realtime import AnamRealtimeProvider
from roomkit.voice.realtime.mock import MockRealtimeTransport

provider = AnamRealtimeProvider(AnamConfig(
    api_key="...", avatar_id="...", voice_id="...", llm_id="...",
))

channel = RealtimeAudioVideoChannel(
    "avatar",
    provider=provider,
    transport=MockRealtimeTransport(),
    vision=my_vision_provider,        # Optional
    vision_interval_ms=3000,
)

kit = RoomKit()
kit.register_channel(channel)

session = await channel.start_session("room-1", "user-1", connection)

Hooks

@kit.hook(HookTrigger.ON_VIDEO_SESSION_STARTED)
async def on_video_started(event, ctx):
    print(f"Avatar video started: {event.session.id}")

@kit.hook(HookTrigger.ON_VIDEO_SESSION_ENDED)
async def on_video_ended(event, ctx):
    print(f"Avatar video ended: {event.session.id}")

Provider API

AnamRealtimeProvider wraps the Anam Python SDK:

Method Description
send_audio(session, bytes) Forward raw PCM audio to Anam (send_user_audio)
inject_text(session, text) Send text through the LLM (send_message)
interrupt(session) Cancel current avatar response
on_audio(callback) Register audio output callback
on_video(callback) Register video frame callback
on_transcription(callback) Register transcription callback

Frame Format

  • Video: VideoFrame(codec="raw_rgb24") — raw RGB pixels (720x480 typical)
  • Audio: PCM int16 mono at 48kHz (downmixed from stereo by the provider)

Testing

Use MockRealtimeAudioVideoProvider for testing without Anam credentials:

from roomkit.voice.realtime.mock import (
    MockRealtimeAudioVideoProvider,
    MockRealtimeTransport,
)
from roomkit.video.video_frame import VideoFrame

provider = MockRealtimeAudioVideoProvider()
transport = MockRealtimeTransport()

channel = RealtimeAudioVideoChannel(
    "test-avatar",
    provider=provider,
    transport=transport,
)

frame = VideoFrame(
    data=b"\x00" * (640 * 480 * 3),
    codec="raw_rgb24",
    width=640, height=480,
)
await provider.simulate_video(session, frame)

Examples

Example Description
examples/realtime_av_anam.py Basic Anam avatar with mock transport
examples/sip_anam_avatar.py SIP-to-Anam bridge with video pipeline and H.264 encoding