Techincal Dive

Techincal Dive

Techincal Dive

September 11, 2025

Integrate Deepgram and Anchor Browser

Integrate Deepgram and Anchor Browser

Integrate Deepgram and Anchor Browser

The next evolution of web browsing isn't faster internet or better search engines. It's eliminating the need to manually browse at all. Instead of typing your intent and clicking through a maze of links, you’ll speak conversationally and AI agents will navigate, analyze, and return exactly what you need from any website. Imagine deploying multiple agents simultaneously when you need to compare information across multiple sources, such as when doing research online. Instead of opening multiple tabs and manually searching individual sites, you could simply speak your request and have AI agents handle the tedious browsing work for you.

This shift represents more than convenience. Browsing with voice-driven commands is a fundamental change in how we interact with the internet. While we’ve grown accustomed to voice assistants like Alexa or Siri handling simple tasks like setting timers or checking weather, complex web search still requires opening browsers, typing queries, and manually navigating through multiple web pages. Voice-driven agentic browsing eliminates this friction entirely.

In the following tutorial, we'll build a voice-controlled system where you can speak queries like, "What are the latest headlines in the NY Post?" and receive immediate results without opening a browser or typing anything. Deepgram will process your speech, understand your intent, and the system will make a request to Anchor Browser’s web task API. An AI agent complete with full browsing and computer vision capabilities will launch, complete a thorough search, and return results. All in seconds and without you lifting a finger.

We provide a pre-recorded voice command to Deepgram's speech-to-text API for this example’s purpose, but you may extend the script to process a live audio recording. Either way, the transcribed text will serve as a prompt for the agent from Anchor Browser. This combination creates a powerful pipeline that transforms spoken requests into actionable web automation.

What is Deepgram?

Deepgram provides a real-time speech-to-text conversion API that processes both live microphone input and uploaded audio files. Unlike basic speech recognition tools, Deepgram handles various accents, speaking speeds, and audio quality levels with enterprise-grade accuracy. The API readily integrates with dozens of audio-based tools such as Twilio, Amazon Connect, and Zoom. For our example, we’ll pass along a pre-recorded WAV audio file that Deepgram will transcribe and format appropriately.

What is Anchor Browser?

Anchor Browser deploys AI agents that navigate websites using computer vision and natural language commands. Rather than relying on brittle CSS selectors or pre-determined page structures, these agents analyze websites visually and understand content contextually as it is rendered on a screen, much like a human would. They can adapt to layout changes, handle dynamic content, and extract information from complex page structures. The agents take screenshots of web pages and use advanced AI models to interpret what they see, then perform actions based on natural language instructions.

The Voice to Web Browser Pipeline

Our example shows a seamless flow from speaking to transcription to browser navigation to results extraction and finally a coherent response. Your voice (or any audio file) gets converted to text by Deepgram. This text becomes the natural language command for Anchor Browser's AI agent. Once invoked, the agent navigates to the appropriate websites and extracts the most relevant information. This pipeline transforms voice into a natural starting point for web automation, creating an entirely new interaction model for the web where speech becomes the primary starting point for completing any query or task. The benefit is immediate access to web content, without manually sleuthing down dozens of rabbit holes from various written queries and clicks.


Project Setup

Prerequisites

Before we dive into the code, you’ll need:

  • Node.js (version 16 or higher)

  • Deepgram API key - sign up at deepgram.com for their free tier

  • Anchor Browser API key - get your API key from the Anchor Browser dashboard

Dependencies Installation

Initialize a new Node.js project and install the required packages:

npm init -y
npm install @deepgram/sdk anchorbrowser dotenv

Package.json Note

Make sure your package.json includes "type": "module" to enable modern import syntax:

{
  "name": "voice-web-automation",
  "version": "1.0.0",
  "type": "module",
  // ... rest of your package.json
}

Environment Configuration

Create a .env file in your project root with the following variables:

DEEPGRAM_API_KEY=your_deepgram_api_key_here
ANCHOR_BROWSER_API_KEY=sk-your_anchor_browser_api_key_here

Make sure to add .env to your .gitignore file to avoid accidentally uploading your keys to git:

echo ".env" >> .gitignore

Core Implementation of Deepgram and Anchor Browser

Let’s build the complete voice to web browsing automation system. We’ll create a script that takes a pre-recorded audio file stored locally, converts it to text with Deepgram, and then uses that text to control an Anchor Browser session.

First, create a new file named voice-web-automation.js:

import { createClient } from '@deepgram/sdk';
import Anchorbrowser from 'anchorbrowser';
import fs from 'fs';
import dotenv from 'dotenv';

// Load environment variables
dotenv.config();

// Initialize clients
const deepgramClient = createClient(process.env.DEEPGRAM_API_KEY);
const anchorClient = new Anchorbrowser({
  apiKey: process.env.ANCHOR_BROWSER_API_KEY
});

async function processVoiceCommand(audioFilePath) {
  try {
    // Step 1: Convert speech to text with Deepgram
    console.log('Transcribing audio...');
    
    const { result, error } = await deepgramClient.listen.prerecorded.transcribeFile(
      fs.createReadStream(audioFilePath),
      {
        model: "nova-2",
        language: "en-US",
        punctuate: true
      }
    );

    if (error) {
      throw new Error(`Deepgram transcription error: ${error}`);
    }
    
    const transcript = result.results.channels[0].alternatives[0].transcript;
    console.log(`Transcribed: "${transcript}"`);
    
    // Step 2: Process the transcript to determine web action
    const webCommand = extractWebCommand(transcript);
    console.log(`Web command: ${webCommand.action}`);
    
    // Step 3: Execute web automation with Anchor Browser
    const webResult = await executeWebAutomation(webCommand);
    
    return {
      transcript,
      result: webResult
    };
    
  } catch (error) {
    console.error('Error processing voice command:', error);
    throw error;
  }
}

function extractWebCommand(transcript) {
  // Simply pass the transcript as the action - let Anchor Browser's AI figure out what to do
  return {
    action: transcript
  };
}

async function executeWebAutomation({ action }) {
  let session;
  
  try {
    console.log('Creating browser session...');
    session = await anchorClient.sessions.create();
    
    console.log('Executing web automation...');
    const response = await anchorClient.tools.performWebTask({
      sessionId: session.data.id,
      prompt: action  // Just pass the raw transcript
    });
    
    const result = response.data.result?.result || response.data.result || response.data;
    return result;
    
  } finally {
    if (session?.data?.id) {
      try {
        await anchorClient.sessions.delete(session.data.id);
        console.log('Browser session cleaned up');
      } catch (cleanupError) {
        console.warn('Failed to cleanup session:', cleanupError);
      }
    }
  }
}

// Example usage
async function main() {
  try {
    const result = await processVoiceCommand('./sample-audio.wav');
    console.log('\n=== FINAL RESULT ===');
    console.log('You said:', result.transcript);
    console.log('Web result:', result.result);
  } catch (error) {
    console.error('Failed to process voice command:', error);
  }
}

if (import.meta.url === `file://${process.argv[1]}`) {
  main();
}

export { processVoiceCommand };

GitHub gist: voice-web-automation.js

Next, download our sample audio file: sample-audio.wav

Deepgram's nova-2 model provides a high-accuracy transcription with punctuation and handles various audio qualities as well. The SDK accepts standard audio formats, such as our case with WAV. Next the extractWebCommand function analyzes the audio transcript to identify target websites and desired actions. It uses simple keyword matching that demonstrates how natural language can be parsed into actionable web commands.

Once the web command is set from the audio transcription, we move to the Anchor Browser request. Anchor receives a natural language command (the transcription from Deepgram) and navigates to the user’s specified website. Using computer vision, the AI agent from Anchor analyzes the content of the page and extracts the requested information. Once our agent completes its task, we close the browser session from Anchor.

To test this script with your own audio file, save a new audio snippet (like sample-audio.wav) in your project directory and try a new voice command, such as “What is the latest headline story from BBC News?”

Run the following command to execute the script:

node voice-web-automation.js

Once successful, the script will return an output from Deepgram like as follows:

Transcribing audio...
Transcribed: "Visit the news website, The New York Post, and tell me the top three headlines from the home page."
Web command: Visit the news website, The New York Post, and tell me the top three headlines from the home page.
Creating browser session...
Executing web automation...
Browser session cleaned up

=== FINAL RESULT ===
You said: Visit the news website, The New York Post, and tell me the top three headlines from the home page.
Web result: The top three headlines from The New York Post home page are:

...Top three headlines from NY Post

By now you've experienced the power of creating a voice-to-web pipeline and you’re ready to take on even more advanced audio integrations or web automation tasks.


Processing Advanced Voice Commands

Beyond basic transcription, Deepgram offers several features that can improve voice-controlled web automation.

Live Streaming Transcription

Enables real-time voice commands without file uploads.

Speaker Detection

Lets you distinguish between multiple users in the same audio file.

Custom Vocabulary Cues

Improves accuracy for domain-specific terms like company names or technical jargon.


Troubleshooting Deepgram

Audio Quality

Deepgram handles most audio formats well, but clear recordings produce better transcription accuracy. Background noise, multiple speakers talking simultaneously, or very low volume may impact the quality of the audio transcript. Test with different microphone setups to find what works best for your use case.

Transcription Accuracy

While Deepgram's models handle various accents and speaking styles effectively, technical terms, proper nouns, or domain-specific jargon may benefit from the custom vocabulary feature shown in the previous section. If transcription consistently misses certain words, consider adding them to the keywords parameter with higher boost values.


Conclusion

Voice-controlled web browsing represents a fundamental shift in how we access information online. The example script shown here demonstrates the foundation for building personal AI research assistants that can navigate any website through natural speech commands. Instead of manually opening browsers, typing queries, and clicking through multiple pages, you can delegate information gathering to AI agents while focusing on analyzing and acting on the results they provide.

The combination of Deepgram's speech recognition and Anchor Browser's visual web navigation creates a robust pipeline that adapts to website changes automatically. Unlike traditional web scraping that breaks when page structures change, this approach benefits from computer vision and natural language understanding to maintain highly reliable functionality.

The real potential merges when extending this foundation to coordinate multiple browsing tasks simultaneously. Rather than browsing sites sequentially, a set of specialized agents could be deployed in parallel, all triggered by a single voice command.

This voice-first approach to web automation bridges the gap between thinking of a task you want to accomplish and actually completing it, transforming web browsing from an active task requiring manual navigation into a conversational interface where you simply describe what you need and receive structured results.

Have ideas or questions on what to build next? Get in contact with us. We'd be happy to help support your project.

Get Notifications For Each Fresh Post

Get Notifications For Each Fresh Post

Get Notifications For Each Fresh Post