> ## Documentation Index
> Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Web Browsing (Browserbase)

> Extract financial data from investor relations PDFs using Stagehand for AI-powered web browsing and Reducto for document processing

Automate the extraction of financial data from investor relations websites by combining Stagehand for AI-powered browser automation with Reducto for PDF processing.

***

## Overview

This cookbook demonstrates how to:

1. Navigate to Apple.com's investor relations section using Stagehand AI actions
2. Automatically download PDFs when links are clicked
3. Poll the Browserbase Downloads API until the file is ready
4. Extract the PDF from the ZIP archive downloaded from Browserbase
5. Upload the PDF to Reducto and extract structured iPhone net sales data
6. Output the extracted financial data as formatted JSON

***

## Prerequisites

Before starting, you need:

* A [Browserbase account](https://browserbase.com) with API key and project ID
* A [Reducto account](https://studio.reducto.ai) with API key
* A Google API key for Gemini (used by Stagehand)
* Python 3.9+

Install the required packages:

```bash theme={null}
pip install browserbase stagehand reductoai python-dotenv
```

Set your environment variables:

```bash theme={null}
export BROWSERBASE_API_KEY="your-browserbase-api-key"
export BROWSERBASE_PROJECT_ID="your-browserbase-project-id"
export REDUCTO_API_KEY="your-reducto-api-key"
export GOOGLE_API_KEY="your-google-api-key"
```

***

## Step-by-step breakdown

### Step 1: Initialize Stagehand and create a session

Stagehand provides AI-powered browser automation on top of Browserbase. Initialize the clients and start a session:

```python theme={null}
import os
from browserbase import Browserbase
from stagehand import AsyncStagehand

bb = Browserbase(api_key=os.environ["BROWSERBASE_API_KEY"])

client = AsyncStagehand(
    browserbase_api_key=os.environ["BROWSERBASE_API_KEY"],
    browserbase_project_id=os.environ["BROWSERBASE_PROJECT_ID"],
    model_api_key=os.environ["GOOGLE_API_KEY"],
)

start_response = await client.sessions.start(model_name="google/gemini-2.5-pro")
session_id = start_response.data.session_id
```

### Step 2: Navigate and trigger download with AI actions

Use Stagehand's AI-powered actions to navigate the page naturally using plain English instructions:

```python theme={null}
await client.sessions.navigate(id=session_id, url="https://www.apple.com/")

await client.sessions.act(
    id=session_id, input="Click the 'Investors' button at the bottom of the page"
)
await client.sessions.act(
    id=session_id, input="Scroll down to the Financial Data section of the page"
)
await client.sessions.act(
    id=session_id, input="Under Quarterly Earnings Reports, click on '2025'"
)
await client.sessions.act(
    id=session_id, input="Click the 'Financial Statements' link under Q4"
)
```

Browserbase automatically captures file downloads when links are clicked.

### Step 3: Poll the Downloads API

Browserbase stores downloads and makes them available through the Downloads API. Poll until the download completes:

```python theme={null}
import asyncio

async def save_downloads_with_retry(bb, session_id, retry_for_seconds=30):
    start_time = asyncio.get_event_loop().time()
    
    while True:
        elapsed = asyncio.get_event_loop().time() - start_time
        if elapsed >= retry_for_seconds:
            raise TimeoutError("Download timeout exceeded")
        
        response = await asyncio.to_thread(bb.sessions.downloads.list, session_id)
        download_buffer = await asyncio.to_thread(response.read)
        
        if len(download_buffer) > 100:
            with open("downloaded_files.zip", "wb") as f:
                f.write(download_buffer)
            return len(download_buffer)
        
        await asyncio.sleep(2)
```

### Step 4: Extract PDF from ZIP

Browserbase returns downloads as a ZIP archive. Extract the PDF:

```python theme={null}
def extract_pdf_from_zip(zip_content):
    with zipfile.ZipFile(io.BytesIO(zip_content)) as zf:
        for filename in zf.namelist():
            if filename.endswith(".pdf"):
                return zf.read(filename)
```

### Step 5: Extract data with Reducto

Upload the PDF to Reducto and extract structured financial data using a schema:

```python theme={null}
upload = reducto.upload(file=("report.pdf", pdf_content))

# Define schema for iPhone net sales extraction
schema = {
    "type": "object",
    "properties": {
        "report_period": {
            "type": "string",
            "description": "The fiscal quarter and year of the report"
        },
        "iphone_net_sales": {
            "type": "object",
            "description": "iPhone net sales figures",
            "properties": {
                "current_quarter": {
                    "type": "number",
                    "description": "iPhone net sales for the current quarter in millions"
                },
                "prior_year_quarter": {
                    "type": "number",
                    "description": "iPhone net sales for the same quarter last year in millions"
                },
                "year_over_year_change": {
                    "type": "number",
                    "description": "Percentage change year over year"
                }
            }
        },
        "total_net_sales": {
            "type": "object",
            "description": "Total company net sales",
            "properties": {
                "current_quarter": {
                    "type": "number",
                    "description": "Total net sales for the current quarter in millions"
                },
                "prior_year_quarter": {
                    "type": "number",
                    "description": "Total net sales for the same quarter last year in millions"
                }
            }
        },
        "iphone_percentage_of_total": {
            "type": "number",
            "description": "iPhone sales as a percentage of total net sales"
        }
    }
}

result = reducto.extract.run(
    input=upload.file_id,
    instructions={"schema": schema}
)
```

### Step 6: Output results

The extracted data is returned as structured JSON matching your schema:

```json theme={null}
{
  "report_period": "Q1 FY2024",
  "iphone_net_sales": {
    "current_quarter": 69702,
    "prior_year_quarter": 65775,
    "year_over_year_change": 5.97
  },
  "total_net_sales": {
    "current_quarter": 119575,
    "prior_year_quarter": 117154
  },
  "iphone_percentage_of_total": 58.3
}
```

## Full implementation

For the complete implementation with all helper functions, error handling, and Reducto extraction schema, see the [full source code on GitHub](https://github.com/browserbase/templates/tree/dev/python/browserbase-reducto).

***

## Resources

<CardGroup cols={2}>
  <Card title="Browserbase Template" icon="browser" href="https://www.browserbase.com/templates/browserbase-reducto">
    View the original Browserbase template
  </Card>

  <Card title="Source Code" icon="github" href="https://github.com/browserbase/templates/tree/dev/python/browserbase-reducto">
    Full source code on GitHub
  </Card>

  <Card title="Extract API" icon="file-export" href="/extract/overview">
    Learn more about Reducto's Extract API
  </Card>

  <Card title="Array Extraction" icon="list" href="/configs/extract/array-extraction">
    Extract multiple records from documents
  </Card>
</CardGroup>
