The Claude 4 + Anchor guide showed how to build a production computer-use agent (CUA) with Anthropic's vision API. But the CUA pattern is model-agnostic. This post wires Gemini 2.5 Pro — Google's leading multimodal model — into Anchor Browser for a fully capable, production-ready CUA in Python.
The core insight is always the same: take a screenshot, describe what you see, decide what to do, act, repeat. What changes is which model you call in the middle of that loop. Anchor handles the browser infrastructure; you choose the brain.
Why Gemini 2.5 Pro for Computer Use?
Gemini 2.5 Pro ships with a few characteristics that make it compelling for browser agents:
- 1M-token context window — fit hundreds of screenshots and action history in a single conversation without pruning
- Native function calling — structured JSON output without prompt engineering hacks
- Strong visual reasoning — accurate element localization from screenshots even on dense UIs
- Competitive pricing — lower per-token cost than comparable models at similar capability levels
Setup
pip install google-genai anchor-browser playwright pydantic
export ANCHOR_API_KEY="your_anchor_key"
export GEMINI_API_KEY="your_gemini_key"
Defining Structured Actions
Before writing the agent loop, define exactly what actions Gemini can take. Pydantic models give you type-safe parsing and catch hallucinated action types before they reach the browser:
from __future__ import annotations
from pydantic import BaseModel, Field
from typing import Literal, Union
from typing_extensions import Annotated
class Click(BaseModel):
action: Literal["click"]
x: int
y: int
description: str
class Type(BaseModel):
action: Literal["type"]
text: str
class Scroll(BaseModel):
action: Literal["scroll"]
x: int
y: int
direction: Literal["up", "down"]
amount: int = 300
class Navigate(BaseModel):
action: Literal["navigate"]
url: str
class Done(BaseModel):
action: Literal["done"]
result: str
BrowserAction = Annotated[
Union[Click, Type, Scroll, Navigate, Done],
Field(discriminator="action")
]
Connecting to Anchor
Anchor exposes a standard CDP endpoint, so you connect via Playwright’s connect_over_cdp — no proprietary SDK required:
import os
import asyncio
from playwright.async_api import async_playwright
from anchor_browser import AnchorClient
anchor = AnchorClient(api_key=os.environ["ANCHOR_API_KEY"])
async def create_browser():
session = anchor.sessions.create(
proxy_country="us",
options={"adblock": True, "blockTrackers": True}
)
pw = await async_playwright().start()
browser = await pw.chromium.connect_over_cdp(session.ws_endpoint)
context = browser.contexts[0]
page = context.pages[0]
return pw, browser, page, session
The Gemini CUA Loop
The loop takes a screenshot, asks Gemini what to do, executes the action, and repeats until Gemini signals it’s done or we hit the iteration cap:
import base64
import json
from pydantic import TypeAdapter
import google.generativeai as genai
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel("gemini-2.5-pro")
ACTION_SCHEMA = {
"type": "object",
"properties": {
"action": {"type": "string", "enum": ["click","type","scroll","navigate","done"]},
"x": {"type": "integer"},
"y": {"type": "integer"},
"text": {"type": "string"},
"direction": {"type": "string", "enum": ["up","down"]},
"amount": {"type": "integer"},
"url": {"type": "string"},
"result": {"type": "string"},
"description": {"type": "string"},
},
"required": ["action"],
}
action_adapter = TypeAdapter(BrowserAction)
async def run_agent(goal: str, start_url: str, max_steps: int = 20) -> str:
pw, browser, page, session = await create_browser()
history = []
try:
await page.goto(start_url)
for step in range(max_steps):
screenshot_bytes = await page.screenshot()
b64_img = base64.b64encode(screenshot_bytes).decode()
current_url = page.url
history_text = chr(10).join(
f'Step {i+1}: {h}' for i, h in enumerate(history[-5:])
)
prompt = (
f"You are a browser agent. Your goal: {goal}\n\n"
f"Current URL: {current_url}\n"
f"Recent actions:\n{history_text or 'None yet'}\n\n"
"Look at the screenshot and decide the single best next action.\n"
f"Respond with valid JSON matching: {json.dumps(ACTION_SCHEMA)}\n\n"
"Use \"done\" only when the goal is fully achieved."
)
response = model.generate_content([
prompt,
{"mime_type": "image/png", "data": b64_img},
])
raw = response.text.strip()
if raw.startswith('```'):
raw = raw.split(chr(10), 1)[1].rsplit('```', 1)[0].strip()
action = action_adapter.validate_json(raw)
history.append(f'{action.action}: {raw[:80]}')
print(f'[step {step+1}] {action.action}')
if isinstance(action, Click):
await page.mouse.click(action.x, action.y)
await page.wait_for_load_state('networkidle', timeout=5000)
elif isinstance(action, Type):
await page.keyboard.type(action.text, delay=30)
elif isinstance(action, Scroll):
delta = action.amount if action.direction == 'down' else -action.amount
await page.mouse.wheel(action.x, action.y, 0, delta)
elif isinstance(action, Navigate):
await page.goto(action.url)
await page.wait_for_load_state('networkidle', timeout=10000)
elif isinstance(action, Done):
return action.result
return f'Reached max steps ({max_steps}). Last URL: {page.url}'
finally:
await browser.close()
await pw.stop()
session.close()
Running It
async def main():
result = await run_agent(
goal="Find the cheapest monthly plan and return the price",
start_url="https://example-saas.com/pricing",
max_steps=15,
)
print("Agent result:", result)
asyncio.run(main())
Production Additions
The loop above is the core. For production workloads, add these three layers:
- Retry on network errors: wrap
model.generate_content()in a tenacity retry with exponential backoff — transient Gemini API errors shouldn’t abort long-running sessions - Screenshot diffing: if two consecutive screenshots are identical and the action was a click, the element likely didn’t respond — detect this and either retry or report failure
- Session warm-up: for agents that run the same workflow repeatedly, keep the Anchor session alive between runs using
anchor.sessions.get(session_id)instead of creating a new one each time
Choosing Between Models
Gemini 2.5 Pro and Claude excel at different things in the CUA loop. Gemini’s longer context window helps when you’re passing full action histories; Claude tends to produce more conservative, cautious actions on ambiguous screens. The cleanest production approach is to benchmark both against your specific workflow — Anchor’s session replay lets you record a run and compare outputs side-by-side.
The infrastructure doesn’t change. The Playwright connection, screenshot loop, and session management are identical regardless of which model drives the agent. That’s the point: Anchor abstracts the browser so you can focus on the intelligence layer.
Try Anchor free and run your first Gemini vision agent in minutes →



