Skip to main content
This guide walks you through using the Reducto API for parsing your first document within 5 mins to extract structured JSON data that can be passed to LLMs or processed further.

What we’re going to parse

We’ll use a financial statement PDF that contains multiple tables, headers, account summaries, and formatted text. This is the kind of complex document that’s difficult to process manually but straightforward with Reducto. Finance Statement Download the sample PDF to follow along. What we want to extract:
  • The portfolio value table with beginning and ending values
  • Account information including account numbers and types
  • Income summary broken down by tax category
  • Top holdings with values and percentages
By the end of this guide, you’ll have all of this data in structured JSON that you can use in your application.

Prerequisites

1

Create a Reducto account

Go to studio.reducto.ai and sign up for a free account.
2

Get your API key

In the Studio sidebar, click API Keys, then Create new API key. Give it a name and copy the key.
Reducto Studio sidebar showing API Keys option

Click API Keys in the sidebar to create a new key

3

Set your API key as an environment variable

This allows the SDK to authenticate automatically without hardcoding the key in your code.
export REDUCTO_API_KEY="your_api_key_here"

Install the SDK

Choose your language and install the Reducto SDK:
pip install reductoai

Parse the document

Now let’s write the code to parse our financial statement. We’ll go through each part step by step.
1

Import the SDK and initialize the client

First, we import the Reducto client and the Path class for handling file paths. When you create a Reducto() client without passing an API key, it automatically reads from the REDUCTO_API_KEY environment variable you set earlier.
from pathlib import Path
from reducto import Reducto

# The client reads REDUCTO_API_KEY from your environment
client = Reducto()
2

Upload your document

Before parsing, you need to upload the document to Reducto’s servers. The upload() method accepts a file path and returns a reference that you’ll use in the next step.
# Upload the PDF file to Reducto
upload = client.upload(file=Path("finance-statement.pdf"))
print(f"Uploaded: {upload}")
You can also pass a URL directly to the parse method if your document is already hosted somewhere accessible, like an S3 bucket.
3

Parse the document

Now we call the parse.run() method with the uploaded file reference. This sends the document through Reducto’s processing pipeline, which runs OCR, detects layout, extracts tables, and structures everything into chunks.
# Parse the uploaded document
result = client.parse.run(input=upload)

# Check what we got back
print(f"Job ID: {result.job_id}")
print(f"Pages processed: {result.usage.num_pages}")
print(f"Credits used: {result.usage.credits}")
print(f"Number of chunks: {len(result.result.chunks)}")
4

Access the extracted content

The response contains chunks, which are logical sections of the document. Each chunk has a content field with the full text and a blocks field with individual elements like tables, headers, and paragraphs.
# Loop through each chunk
for i, chunk in enumerate(result.result.chunks):
    print(f"\n=== Chunk {i + 1} ===")
    print(chunk.content[:500])  # First 500 characters
    
    # Look at individual blocks within this chunk
    for block in chunk.blocks:
        print(f"  [{block.type}] on page {block.bbox.page}")
        
        # Tables are returned as HTML by default
        if block.type == "Table":
            print(f"  Table content: {block.content[:200]}...")
Each block has a type that tells you what kind of content it is: Title, Header, Text, Table, Figure, Key Value, and others. The bbox field contains the bounding box coordinates so you know exactly where on the page this content came from.
Complete code:
from pathlib import Path
from reducto import Reducto

client = Reducto()
upload = client.upload(file=Path("finance-statement.pdf"))
result = client.parse.run(input=upload)

print(f"Processed {result.usage.num_pages} pages")

for chunk in result.result.chunks:
    print(chunk.content)
    for block in chunk.blocks:
        if block.type == "Table":
            print(f"Found table on page {block.bbox.page}")

Understanding the response

Here’s what we got back from parsing our financial statement:
{
  "job_id": "5df31070-8d98-4caa-9a5b-c5c511a03f71",
  "duration": 11.35,
  "usage": {
    "num_pages": 3,
    "credits": 4.0
  },
  "result": {
    "chunks": [
      {
        "content": "# *** SAMPLE STATEMENT ***\nFor informational purposes only\n\nFidelity\nINVESTMENTS\n\n## Your Portfolio Value:\n\n$274,222.20\n\n|                                   | This Period   | Year-to-Date   |\n|-|-|-|\n| Beginning Portfolio Value         | $253,221.83   | $232,643.16    |\n| Additions                         | 59,269.64     | 121,433.55     |...",
        "blocks": [
          {
            "type": "Title",
            "content": "*** SAMPLE STATEMENT ***\nFor informational purposes only",
            "bbox": {"page": 1, "left": 0.351, "top": 0.029, "width": 0.296, "height": 0.057},
            "confidence": "high"
          },
          {
            "type": "Section Header",
            "content": "Your Portfolio Value:",
            "bbox": {"page": 1, "left": 0.517, "top": 0.163, "width": 0.153, "height": 0.015},
            "confidence": "high"
          },
          {
            "type": "Table",
            "content": "|                                   | This Period   | Year-to-Date   |\n|-|-|-|\n| Beginning Portfolio Value         | $253,221.83   | $232,643.16    |\n| Additions                         | 59,269.64     | 121,433.55     |\n| Subtractions                      | -45,430.74    | -98,912.58     |\n| Transaction Costs, Fees & Charges | -139.77       | -625.87        |\n| Change in Investment Value*       | 7,161.47      | 19,058.07      |\n| Ending Portfolio Value**          | $274,222.20   | $274,222.20    |",
            "bbox": {"page": 1, "left": 0.516, "top": 0.261, "width": 0.444, "height": 0.158},
            "confidence": "high"
          }
        ]
      }
    ]
  },
  "studio_link": "https://studio.reducto.ai/job/5df31070-8d98-4caa-9a5b-c5c511a03f71"
}
Key fields:
FieldWhat it contains
job_idUnique identifier for this job. Use it to retrieve results later or debug in Studio.
usage.num_pagesNumber of pages that were processed.
usage.creditsCredits consumed by this request.
chunksLogical sections of the document, optimized for feeding into LLMs.
chunks[].contentThe full text content of this chunk.
chunks[].blocksIndividual elements (tables, headers, text) with their types and positions.
blocks[].typeWhat kind of element this is: Title, Table, Header, Text, Figure, etc.
blocks[].bboxBounding box with normalized coordinates (0-1) showing where this element appears on the page.
studio_linkDirect link to view this job in Reducto Studio for visual debugging.

Customizing the output

The default settings work well for most documents, but you can customize the parsing behavior for specific use cases.
result = client.parse.run(
    input=upload,
    enhance={
        # Use AI to clean up OCR errors in scanned documents
        "agentic": [{"scope": "text"}],
        # Generate descriptions for charts and images
        "summarize_figures": True
    },
    formatting={
        # Get tables as HTML, markdown, json, or csv
        "table_output_format": "markdown"
    },
    settings={
        # Only process pages 1-5
        "page_range": {"start": 1, "end": 5}
    }
)
What these options do:
  • enhance.agentic: Runs AI-powered cleanup on the specified scope. Use "text" for OCR correction on scanned documents, or "table" to improve table structure detection.
  • enhance.summarize_figures: Generates natural language descriptions of charts, graphs, and images. Useful for RAG pipelines where you need to search figure content.
  • formatting.table_output_format: Controls how tables are returned. Options are html, markdown, json, csv, or ai_json for complex tables that need AI reconstruction.
  • settings.page_range: Limits processing to specific pages. Useful for large documents where you only need certain sections.
For the full list of options, see the Parse configuration reference.

What’s next

Now that you can parse documents, explore the other Reducto endpoints:

Troubleshooting

This means your API key is missing or invalid. Check that the REDUCTO_API_KEY environment variable is set correctly and that the key hasn’t expired in Studio.
Some complex tables need extra help. Try setting formatting.table_output_format to "ai_json", or enable enhance.agentic with [{"scope": "table"}] for AI-powered table reconstruction.
For scanned documents or low-quality PDFs, enable the agentic text enhancement: enhance.agentic: [{"scope": "text"}]. If the document is password-protected, pass the password in settings.document_password. This may also be due to bad metadata polluting the output, in which case, reach out to Reducto support.
Every response includes a studio_link that opens the job in Reducto Studio. Use it to visually inspect what was extracted and debug any issues.