Skip to main content
Automate the extraction of financial data from investor relations websites by combining Stagehand for AI-powered browser automation with Reducto for PDF processing.

Overview

This cookbook demonstrates how to:
  1. Navigate to Apple.com’s investor relations section using Stagehand AI actions
  2. Automatically download PDFs when links are clicked
  3. Poll the Browserbase Downloads API until the file is ready
  4. Extract the PDF from the ZIP archive downloaded from Browserbase
  5. Upload the PDF to Reducto and extract structured iPhone net sales data
  6. Output the extracted financial data as formatted JSON

Prerequisites

Before starting, you need: Install the required packages:
pip install browserbase stagehand reductoai python-dotenv
Set your environment variables:
export BROWSERBASE_API_KEY="your-browserbase-api-key"
export BROWSERBASE_PROJECT_ID="your-browserbase-project-id"
export REDUCTO_API_KEY="your-reducto-api-key"
export GOOGLE_API_KEY="your-google-api-key"

Step-by-step breakdown

Step 1: Initialize Stagehand and create a session

Stagehand provides AI-powered browser automation on top of Browserbase. Initialize the clients and start a session:
import os
from browserbase import Browserbase
from stagehand import AsyncStagehand

bb = Browserbase(api_key=os.environ["BROWSERBASE_API_KEY"])

client = AsyncStagehand(
    browserbase_api_key=os.environ["BROWSERBASE_API_KEY"],
    browserbase_project_id=os.environ["BROWSERBASE_PROJECT_ID"],
    model_api_key=os.environ["GOOGLE_API_KEY"],
)

start_response = await client.sessions.start(model_name="google/gemini-2.5-pro")
session_id = start_response.data.session_id

Step 2: Navigate and trigger download with AI actions

Use Stagehand’s AI-powered actions to navigate the page naturally using plain English instructions:
await client.sessions.navigate(id=session_id, url="https://www.apple.com/")

await client.sessions.act(
    id=session_id, input="Click the 'Investors' button at the bottom of the page"
)
await client.sessions.act(
    id=session_id, input="Scroll down to the Financial Data section of the page"
)
await client.sessions.act(
    id=session_id, input="Under Quarterly Earnings Reports, click on '2025'"
)
await client.sessions.act(
    id=session_id, input="Click the 'Financial Statements' link under Q4"
)
Browserbase automatically captures file downloads when links are clicked.

Step 3: Poll the Downloads API

Browserbase stores downloads and makes them available through the Downloads API. Poll until the download completes:
import asyncio

async def save_downloads_with_retry(bb, session_id, retry_for_seconds=30):
    start_time = asyncio.get_event_loop().time()
    
    while True:
        elapsed = asyncio.get_event_loop().time() - start_time
        if elapsed >= retry_for_seconds:
            raise TimeoutError("Download timeout exceeded")
        
        response = await asyncio.to_thread(bb.sessions.downloads.list, session_id)
        download_buffer = await asyncio.to_thread(response.read)
        
        if len(download_buffer) > 100:
            with open("downloaded_files.zip", "wb") as f:
                f.write(download_buffer)
            return len(download_buffer)
        
        await asyncio.sleep(2)

Step 4: Extract PDF from ZIP

Browserbase returns downloads as a ZIP archive. Extract the PDF:
def extract_pdf_from_zip(zip_content):
    with zipfile.ZipFile(io.BytesIO(zip_content)) as zf:
        for filename in zf.namelist():
            if filename.endswith(".pdf"):
                return zf.read(filename)

Step 5: Extract data with Reducto

Upload the PDF to Reducto and extract structured financial data using a schema:
upload = reducto.upload(file=("report.pdf", pdf_content))

# Define schema for iPhone net sales extraction
schema = {
    "type": "object",
    "properties": {
        "report_period": {
            "type": "string",
            "description": "The fiscal quarter and year of the report"
        },
        "iphone_net_sales": {
            "type": "object",
            "description": "iPhone net sales figures",
            "properties": {
                "current_quarter": {
                    "type": "number",
                    "description": "iPhone net sales for the current quarter in millions"
                },
                "prior_year_quarter": {
                    "type": "number",
                    "description": "iPhone net sales for the same quarter last year in millions"
                },
                "year_over_year_change": {
                    "type": "number",
                    "description": "Percentage change year over year"
                }
            }
        },
        "total_net_sales": {
            "type": "object",
            "description": "Total company net sales",
            "properties": {
                "current_quarter": {
                    "type": "number",
                    "description": "Total net sales for the current quarter in millions"
                },
                "prior_year_quarter": {
                    "type": "number",
                    "description": "Total net sales for the same quarter last year in millions"
                }
            }
        },
        "iphone_percentage_of_total": {
            "type": "number",
            "description": "iPhone sales as a percentage of total net sales"
        }
    }
}

result = reducto.extract.run(
    input=upload.file_id,
    instructions={"schema": schema}
)

Step 6: Output results

The extracted data is returned as structured JSON matching your schema:
{
  "report_period": "Q1 FY2024",
  "iphone_net_sales": {
    "current_quarter": 69702,
    "prior_year_quarter": 65775,
    "year_over_year_change": 5.97
  },
  "total_net_sales": {
    "current_quarter": 119575,
    "prior_year_quarter": 117154
  },
  "iphone_percentage_of_total": 58.3
}

Full implementation

For the complete implementation with all helper functions, error handling, and Reducto extraction schema, see the full source code on GitHub.

Resources