Browser Agents + Pydantic: Type-Safe Structured Data Extraction

Technical Dive
Jun 5
by Idan Raman

Web pages aren't APIs. A product price lives inside a <span> with three nested divs and a class name like price--highlight-v2. An agent reading that page might return "$99", "99.0", or "USD 99/mo" depending on how the DOM happened to render at extraction time.

The fix isn't better prompts. It's treating structured output as a hard contract rather than a polite request. In this post we'll pair Anchor browser sessions with Pydantic and the Claude API to extract clean, validated data from any website—reliably.

Why Unstructured Extraction Fails at Scale

Free-form extraction breaks in three predictable ways:

  • Inconsistent formats — Prices without currency symbols, dates in mixed locales, nullable fields returned as empty strings instead of null.
  • Silent failures — The agent returns a structurally valid result that is semantically wrong. Your downstream code accepts it without complaint.
  • No schema enforcement — Every prompt tweak risks reshaping output in ways the rest of your pipeline can't handle.

Pydantic solves all three. Define the shape you expect, let the LLM fill it in, and let Pydantic reject anything that doesn't conform.

Setup

You'll need an Anchor API key and a few packages:

pip install anthropic pydantic anchorpy requests

Anchor provisions each session as an isolated, full Chromium browser with a consistent fingerprint. Sessions are ephemeral by default—there is no state leakage or cookie bleed between jobs.

Define Your Schema

Let's extract job listings from a public job board. Here's the Pydantic model:

from pydantic import BaseModel
from typing import Optional

class JobListing(BaseModel):
    title: str
    company: str
    location: str
    salary_min: Optional[int] = None
    salary_max: Optional[int] = None
    currency: str = "USD"
    remote: bool = False
    posted_days_ago: Optional[int] = None

Key decisions: monetary values are integers (not strings), booleans are typed, and optional fields carry sensible defaults. The schema will travel with the request to the LLM, so the model has an explicit output contract rather than an implicit one.

Extract with Anchor + Claude

Here is the full extraction function:

import anthropic
import json
from anchorpy import AnchorClient
from pydantic import ValidationError

anchor = AnchorClient(api_key="your-anchor-api-key")
claude  = anthropic.Anthropic()

def extract_job_listings(url: str) -> list[JobListing]:
    # Isolated browser session — no state from previous runs
    with anchor.session() as session:
        page = session.navigate(url)
        text = page.get_text()   # clean semantic text, not raw HTML

    response = claude.messages.create(
        model="claude-opus-4-8",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": (
                "Extract all job listings from this page.
"
                "Return a JSON array matching this schema:
"
                f"{json.dumps(JobListing.model_json_schema(), indent=2)}

"
                f"Page content:
{text[:8000]}"
            )
        }]
    )

    listings = []
    for item in json.loads(response.content[0].text):
        try:
            listings.append(JobListing(**item))
        except ValidationError as e:
            print(f"Skipping invalid listing: {e}")

    return listings

Three things make this reliable at scale:

  • page.get_text() strips boilerplate HTML and returns clean semantic content, cutting token usage by 60–80 % versus raw DOM.
  • The Pydantic schema JSON is embedded in the prompt, giving the model an explicit output contract.
  • Validation runs per-item — one malformed listing doesn't abort the whole batch.

Handling Dynamic Pages

Some pages load listings via JavaScript after the initial render. Anchor waits for network idle before returning content, so you don't need sleep() calls or DOM-polling loops.

For pages behind authentication, OmniConnect attaches persistent sessions with stored credentials so agents never need to re-login between runs.

Scaling Up

Need to scrape hundreds of pages concurrently? Anchor sessions are designed to be parallelized. Each session gets its own isolated browser context, so you can run dozens in parallel without fingerprint or cookie collisions:

import asyncio
from anchorpy import AsyncAnchorClient

async def scrape_all(urls: list[str]) -> list[list[JobListing]]:
    async with AsyncAnchorClient(api_key="your-anchor-api-key") as anchor:
        tasks = [extract_job_listings_async(anchor, url) for url in urls]
        return await asyncio.gather(*tasks)

Pydantic validation still runs per-item in every task, so your data contract holds regardless of how many sessions run in parallel.

Taking It Further

Once you have typed models, the downstream pipeline writes itself:

  • Dump to PostgreSQL with listing.model_dump() into a typed SQLAlchemy table.
  • Compare sequential runs for diff-based monitoring — which listings disappeared since yesterday?
  • Feed clean structured records into a downstream reasoning step that operates on data, not raw HTML.

The browser agent handles the messy web. Pydantic enforces the contract. You write the logic that actually matters.

Start a free Anchor session and run this pipeline in under five minutes — session management, browser isolation, and clean text extraction are all handled for you.

Stay ahead in browser automation

We respect your inbox. Privacy policy

Welcome aboard! Thanks for signing up
Oops! Something went wrong while submitting the form.