Web pages aren't APIs. A product price lives inside a <span> with three nested divs and a class name like price--highlight-v2. An agent reading that page might return "$99", "99.0", or "USD 99/mo" depending on how the DOM happened to render at extraction time.
The fix isn't better prompts. It's treating structured output as a hard contract rather than a polite request. In this post we'll pair Anchor browser sessions with Pydantic and the Claude API to extract clean, validated data from any website—reliably.
Why Unstructured Extraction Fails at Scale
Free-form extraction breaks in three predictable ways:
- Inconsistent formats — Prices without currency symbols, dates in mixed locales, nullable fields returned as empty strings instead of
null. - Silent failures — The agent returns a structurally valid result that is semantically wrong. Your downstream code accepts it without complaint.
- No schema enforcement — Every prompt tweak risks reshaping output in ways the rest of your pipeline can't handle.
Pydantic solves all three. Define the shape you expect, let the LLM fill it in, and let Pydantic reject anything that doesn't conform.
Setup
You'll need an Anchor API key and a few packages:
pip install anthropic pydantic anchorpy requests
Anchor provisions each session as an isolated, full Chromium browser with a consistent fingerprint. Sessions are ephemeral by default—there is no state leakage or cookie bleed between jobs.
Define Your Schema
Let's extract job listings from a public job board. Here's the Pydantic model:
from pydantic import BaseModel
from typing import Optional
class JobListing(BaseModel):
title: str
company: str
location: str
salary_min: Optional[int] = None
salary_max: Optional[int] = None
currency: str = "USD"
remote: bool = False
posted_days_ago: Optional[int] = None
Key decisions: monetary values are integers (not strings), booleans are typed, and optional fields carry sensible defaults. The schema will travel with the request to the LLM, so the model has an explicit output contract rather than an implicit one.
Extract with Anchor + Claude
Here is the full extraction function:
import anthropic
import json
from anchorpy import AnchorClient
from pydantic import ValidationError
anchor = AnchorClient(api_key="your-anchor-api-key")
claude = anthropic.Anthropic()
def extract_job_listings(url: str) -> list[JobListing]:
# Isolated browser session — no state from previous runs
with anchor.session() as session:
page = session.navigate(url)
text = page.get_text() # clean semantic text, not raw HTML
response = claude.messages.create(
model="claude-opus-4-8",
max_tokens=4096,
messages=[{
"role": "user",
"content": (
"Extract all job listings from this page.
"
"Return a JSON array matching this schema:
"
f"{json.dumps(JobListing.model_json_schema(), indent=2)}
"
f"Page content:
{text[:8000]}"
)
}]
)
listings = []
for item in json.loads(response.content[0].text):
try:
listings.append(JobListing(**item))
except ValidationError as e:
print(f"Skipping invalid listing: {e}")
return listings
Three things make this reliable at scale:
page.get_text()strips boilerplate HTML and returns clean semantic content, cutting token usage by 60–80 % versus raw DOM.- The Pydantic schema JSON is embedded in the prompt, giving the model an explicit output contract.
- Validation runs per-item — one malformed listing doesn't abort the whole batch.
Handling Dynamic Pages
Some pages load listings via JavaScript after the initial render. Anchor waits for network idle before returning content, so you don't need sleep() calls or DOM-polling loops.
For pages behind authentication, OmniConnect attaches persistent sessions with stored credentials so agents never need to re-login between runs.
Scaling Up
Need to scrape hundreds of pages concurrently? Anchor sessions are designed to be parallelized. Each session gets its own isolated browser context, so you can run dozens in parallel without fingerprint or cookie collisions:
import asyncio
from anchorpy import AsyncAnchorClient
async def scrape_all(urls: list[str]) -> list[list[JobListing]]:
async with AsyncAnchorClient(api_key="your-anchor-api-key") as anchor:
tasks = [extract_job_listings_async(anchor, url) for url in urls]
return await asyncio.gather(*tasks)
Pydantic validation still runs per-item in every task, so your data contract holds regardless of how many sessions run in parallel.
Taking It Further
Once you have typed models, the downstream pipeline writes itself:
- Dump to PostgreSQL with
listing.model_dump()into a typed SQLAlchemy table. - Compare sequential runs for diff-based monitoring — which listings disappeared since yesterday?
- Feed clean structured records into a downstream reasoning step that operates on data, not raw HTML.
The browser agent handles the messy web. Pydantic enforces the contract. You write the logic that actually matters.
Start a free Anchor session and run this pipeline in under five minutes — session management, browser isolation, and clean text extraction are all handled for you.



