Gemini 2.5 + Anchor: Building a Production Vision Agent in Python

The Claude 4 + Anchor guide showed how to build a production computer-use agent (CUA) with Anthropic's vision API. But the CUA pattern is model-agnostic. This post wires Gemini 2.5 Pro — Google's leading multimodal model — into Anchor Browser for a fully capable, production-ready CUA in Python.

The core insight is always the same: take a screenshot, describe what you see, decide what to do, act, repeat. What changes is which model you call in the middle of that loop. Anchor handles the browser infrastructure; you choose the brain.

Why Gemini 2.5 Pro for Computer Use?

Gemini 2.5 Pro ships with a few characteristics that make it compelling for browser agents:

1M-token context window — fit hundreds of screenshots and action history in a single conversation without pruning
Native function calling — structured JSON output without prompt engineering hacks
Strong visual reasoning — accurate element localization from screenshots even on dense UIs
Competitive pricing — lower per-token cost than comparable models at similar capability levels

Setup

pip install google-genai anchor-browser playwright pydantic

export ANCHOR_API_KEY="your_anchor_key"
export GEMINI_API_KEY="your_gemini_key"

Defining Structured Actions

Before writing the agent loop, define exactly what actions Gemini can take. Pydantic models give you type-safe parsing and catch hallucinated action types before they reach the browser:

from __future__ import annotations
from pydantic import BaseModel, Field
from typing import Literal, Union
from typing_extensions import Annotated

class Click(BaseModel):
    action: Literal["click"]
    x: int
    y: int
    description: str

class Type(BaseModel):
    action: Literal["type"]
    text: str

class Scroll(BaseModel):
    action: Literal["scroll"]
    x: int
    y: int
    direction: Literal["up", "down"]
    amount: int = 300

class Navigate(BaseModel):
    action: Literal["navigate"]
    url: str

class Done(BaseModel):
    action: Literal["done"]
    result: str

BrowserAction = Annotated[
    Union[Click, Type, Scroll, Navigate, Done],
    Field(discriminator="action")
]

Connecting to Anchor

Anchor exposes a standard CDP endpoint, so you connect via Playwright’s connect_over_cdp — no proprietary SDK required:

import os
import asyncio
from playwright.async_api import async_playwright
from anchor_browser import AnchorClient

anchor = AnchorClient(api_key=os.environ["ANCHOR_API_KEY"])

async def create_browser():
    session = anchor.sessions.create(
        proxy_country="us",
        options={"adblock": True, "blockTrackers": True}
    )
    pw = await async_playwright().start()
    browser = await pw.chromium.connect_over_cdp(session.ws_endpoint)
    context = browser.contexts[0]
    page = context.pages[0]
    return pw, browser, page, session

The Gemini CUA Loop

The loop takes a screenshot, asks Gemini what to do, executes the action, and repeats until Gemini signals it’s done or we hit the iteration cap:

import base64
import json
from pydantic import TypeAdapter
import google.generativeai as genai

genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel("gemini-2.5-pro")

ACTION_SCHEMA = {
    "type": "object",
    "properties": {
        "action": {"type": "string", "enum": ["click","type","scroll","navigate","done"]},
        "x": {"type": "integer"},
        "y": {"type": "integer"},
        "text": {"type": "string"},
        "direction": {"type": "string", "enum": ["up","down"]},
        "amount": {"type": "integer"},
        "url": {"type": "string"},
        "result": {"type": "string"},
        "description": {"type": "string"},
    },
    "required": ["action"],
}

action_adapter = TypeAdapter(BrowserAction)

async def run_agent(goal: str, start_url: str, max_steps: int = 20) -> str:
    pw, browser, page, session = await create_browser()
    history = []

    try:
        await page.goto(start_url)

        for step in range(max_steps):
            screenshot_bytes = await page.screenshot()
            b64_img = base64.b64encode(screenshot_bytes).decode()
            current_url = page.url

            history_text = chr(10).join(
                f'Step {i+1}: {h}' for i, h in enumerate(history[-5:])
            )
            prompt = (
                f"You are a browser agent. Your goal: {goal}\n\n"
                f"Current URL: {current_url}\n"
                f"Recent actions:\n{history_text or 'None yet'}\n\n"
                "Look at the screenshot and decide the single best next action.\n"
                f"Respond with valid JSON matching: {json.dumps(ACTION_SCHEMA)}\n\n"
                "Use \"done\" only when the goal is fully achieved."
            )

            response = model.generate_content([
                prompt,
                {"mime_type": "image/png", "data": b64_img},
            ])

            raw = response.text.strip()
            if raw.startswith('```'):
                raw = raw.split(chr(10), 1)[1].rsplit('```', 1)[0].strip()

            action = action_adapter.validate_json(raw)
            history.append(f'{action.action}: {raw[:80]}')
            print(f'[step {step+1}] {action.action}')

            if isinstance(action, Click):
                await page.mouse.click(action.x, action.y)
                await page.wait_for_load_state('networkidle', timeout=5000)

            elif isinstance(action, Type):
                await page.keyboard.type(action.text, delay=30)

            elif isinstance(action, Scroll):
                delta = action.amount if action.direction == 'down' else -action.amount
                await page.mouse.wheel(action.x, action.y, 0, delta)

            elif isinstance(action, Navigate):
                await page.goto(action.url)
                await page.wait_for_load_state('networkidle', timeout=10000)

            elif isinstance(action, Done):
                return action.result

        return f'Reached max steps ({max_steps}). Last URL: {page.url}'

    finally:
        await browser.close()
        await pw.stop()
        session.close()

Running It

async def main():
    result = await run_agent(
        goal="Find the cheapest monthly plan and return the price",
        start_url="https://example-saas.com/pricing",
        max_steps=15,
    )
    print("Agent result:", result)

asyncio.run(main())

Production Additions

The loop above is the core. For production workloads, add these three layers:

Retry on network errors: wrap model.generate_content() in a tenacity retry with exponential backoff — transient Gemini API errors shouldn’t abort long-running sessions
Screenshot diffing: if two consecutive screenshots are identical and the action was a click, the element likely didn’t respond — detect this and either retry or report failure
Session warm-up: for agents that run the same workflow repeatedly, keep the Anchor session alive between runs using anchor.sessions.get(session_id) instead of creating a new one each time

Choosing Between Models

Gemini 2.5 Pro and Claude excel at different things in the CUA loop. Gemini’s longer context window helps when you’re passing full action histories; Claude tends to produce more conservative, cautious actions on ambiguous screens. The cleanest production approach is to benchmark both against your specific workflow — Anchor’s session replay lets you record a run and compare outputs side-by-side.

The infrastructure doesn’t change. The Playwright connection, screenshot loop, and session management are identical regardless of which model drives the agent. That’s the point: Anchor abstracts the browser so you can focus on the intelligence layer.

Try Anchor free and run your first Gemini vision agent in minutes →

Gemini 2.5 + Anchor: Building a Production Vision Agent in Python

Why Gemini 2.5 Pro for Computer Use?

Setup

Defining Structured Actions

Connecting to Anchor

The Gemini CUA Loop

Running It

Production Additions

Choosing Between Models

Recent articles

Understanding Browser Sessions and State Management

Enterprise Browser Environments Explained: Isolation, Identity, and Governance

Automating Vendor Portals for Procurement and Supply Chain Teams

Stay ahead in browser automation