Web Browsing (Browserbase)

Automate the extraction of financial data from investor relations websites by combining Stagehand for AI-powered browser automation with Reducto for PDF processing.

Overview

This cookbook demonstrates how to:

Navigate to Apple.com’s investor relations section using Stagehand AI actions
Automatically download PDFs when links are clicked
Poll the Browserbase Downloads API until the file is ready
Extract the PDF from the ZIP archive downloaded from Browserbase
Upload the PDF to Reducto and extract structured iPhone net sales data
Output the extracted financial data as formatted JSON

Prerequisites

Before starting, you need:

A Browserbase account with API key and project ID
A Reducto account with API key
A Google API key for Gemini (used by Stagehand)
Python 3.9+

Install the required packages:

pip install browserbase stagehand reductoai python-dotenv

Set your environment variables:

export BROWSERBASE_API_KEY="your-browserbase-api-key"
export BROWSERBASE_PROJECT_ID="your-browserbase-project-id"
export REDUCTO_API_KEY="your-reducto-api-key"
export GOOGLE_API_KEY="your-google-api-key"

Step-by-step breakdown

Step 1: Initialize Stagehand and create a session

Stagehand provides AI-powered browser automation on top of Browserbase. Initialize the clients and start a session:

import os
from browserbase import Browserbase
from stagehand import AsyncStagehand

bb = Browserbase(api_key=os.environ["BROWSERBASE_API_KEY"])

client = AsyncStagehand(
    browserbase_api_key=os.environ["BROWSERBASE_API_KEY"],
    browserbase_project_id=os.environ["BROWSERBASE_PROJECT_ID"],
    model_api_key=os.environ["GOOGLE_API_KEY"],
)

start_response = await client.sessions.start(model_name="google/gemini-2.5-pro")
session_id = start_response.data.session_id

Step 2: Navigate and trigger download with AI actions

Use Stagehand’s AI-powered actions to navigate the page naturally using plain English instructions:

await client.sessions.navigate(id=session_id, url="https://www.apple.com/")

await client.sessions.act(
    id=session_id, input="Click the 'Investors' button at the bottom of the page"
)
await client.sessions.act(
    id=session_id, input="Scroll down to the Financial Data section of the page"
)
await client.sessions.act(
    id=session_id, input="Under Quarterly Earnings Reports, click on '2025'"
)
await client.sessions.act(
    id=session_id, input="Click the 'Financial Statements' link under Q4"
)

Browserbase automatically captures file downloads when links are clicked.

Step 3: Poll the Downloads API

Browserbase stores downloads and makes them available through the Downloads API. Poll until the download completes:

import asyncio

async def save_downloads_with_retry(bb, session_id, retry_for_seconds=30):
    start_time = asyncio.get_event_loop().time()
    
    while True:
        elapsed = asyncio.get_event_loop().time() - start_time
        if elapsed >= retry_for_seconds:
            raise TimeoutError("Download timeout exceeded")
        
        response = await asyncio.to_thread(bb.sessions.downloads.list, session_id)
        download_buffer = await asyncio.to_thread(response.read)
        
        if len(download_buffer) > 100:
            with open("downloaded_files.zip", "wb") as f:
                f.write(download_buffer)
            return len(download_buffer)
        
        await asyncio.sleep(2)

Step 4: Extract PDF from ZIP

Browserbase returns downloads as a ZIP archive. Extract the PDF:

def extract_pdf_from_zip(zip_content):
    with zipfile.ZipFile(io.BytesIO(zip_content)) as zf:
        for filename in zf.namelist():
            if filename.endswith(".pdf"):
                return zf.read(filename)

Step 5: Extract data with Reducto

Upload the PDF to Reducto and extract structured financial data using a schema:

upload = reducto.upload(file=("report.pdf", pdf_content))

# Define schema for iPhone net sales extraction
schema = {
    "type": "object",
    "properties": {
        "report_period": {
            "type": "string",
            "description": "The fiscal quarter and year of the report"
        },
        "iphone_net_sales": {
            "type": "object",
            "description": "iPhone net sales figures",
            "properties": {
                "current_quarter": {
                    "type": "number",
                    "description": "iPhone net sales for the current quarter in millions"
                },
                "prior_year_quarter": {
                    "type": "number",
                    "description": "iPhone net sales for the same quarter last year in millions"
                },
                "year_over_year_change": {
                    "type": "number",
                    "description": "Percentage change year over year"
                }
            }
        },
        "total_net_sales": {
            "type": "object",
            "description": "Total company net sales",
            "properties": {
                "current_quarter": {
                    "type": "number",
                    "description": "Total net sales for the current quarter in millions"
                },
                "prior_year_quarter": {
                    "type": "number",
                    "description": "Total net sales for the same quarter last year in millions"
                }
            }
        },
        "iphone_percentage_of_total": {
            "type": "number",
            "description": "iPhone sales as a percentage of total net sales"
        }
    }
}

result = reducto.extract.run(
    input=upload.file_id,
    instructions={"schema": schema}
)

Step 6: Output results

The extracted data is returned as structured JSON matching your schema:

{
  "report_period": "Q1 FY2024",
  "iphone_net_sales": {
    "current_quarter": 69702,
    "prior_year_quarter": 65775,
    "year_over_year_change": 5.97
  },
  "total_net_sales": {
    "current_quarter": 119575,
    "prior_year_quarter": 117154
  },
  "iphone_percentage_of_total": 58.3
}

Full implementation

For the complete implementation with all helper functions, error handling, and Reducto extraction schema, see the full source code on GitHub.

Resources

Browserbase Template

View the original Browserbase template

Source Code

Full source code on GitHub

Extract API

Learn more about Reducto’s Extract API

Array Extraction

Extract multiple records from documents

Get Started

Pipelines & Workflows

Document Examples

Web Browsing (Browserbase)

Overview

Prerequisites

Step-by-step breakdown

Step 1: Initialize Stagehand and create a session

Step 2: Navigate and trigger download with AI actions

Step 3: Poll the Downloads API

Step 4: Extract PDF from ZIP

Step 5: Extract data with Reducto

Step 6: Output results

Full implementation

Resources

Browserbase Template

Source Code

Extract API

Array Extraction

Get Started

Pipelines & Workflows

Document Examples

​Overview

​Prerequisites

​Step-by-step breakdown

​Step 1: Initialize Stagehand and create a session

​Step 2: Navigate and trigger download with AI actions

​Step 3: Poll the Downloads API

​Step 4: Extract PDF from ZIP

​Step 5: Extract data with Reducto

​Step 6: Output results

​Full implementation

​Resources

Browserbase Template

Source Code

Extract API

Array Extraction

Overview

Prerequisites

Step-by-step breakdown

Step 1: Initialize Stagehand and create a session

Step 2: Navigate and trigger download with AI actions

Step 3: Poll the Downloads API

Step 4: Extract PDF from ZIP

Step 5: Extract data with Reducto

Step 6: Output results

Full implementation

Resources