Extract text, tables, and figures from documents using the Reducto API.
This guide walks you through using the Reducto API for parsing your first document within 5 mins to extract structured JSON data that can be passed to LLMs or processed further.
We’ll use a financial statement PDF that contains multiple tables, headers, account summaries, and formatted text. This is the kind of complex document that’s difficult to process manually but straightforward with Reducto.Download the sample PDF to follow along.What we want to extract:
The portfolio value table with beginning and ending values
Account information including account numbers and types
Income summary broken down by tax category
Top holdings with values and percentages
By the end of this guide, you’ll have all of this data in structured JSON that you can use in your application.
Now let’s write the code to parse our financial statement. We’ll go through each part step by step.
Python
Node.js
Go
cURL
1
Import the SDK and initialize the client
First, we import the Reducto client and the Path class for handling file paths. When you create a Reducto() client without passing an API key, it automatically reads from the REDUCTO_API_KEY environment variable you set earlier.
Copy
Ask AI
from pathlib import Pathfrom reducto import Reducto# The client reads REDUCTO_API_KEY from your environmentclient = Reducto()
2
Upload your document
Before parsing, you need to upload the document to Reducto’s servers. The upload() method accepts a file path and returns a reference that you’ll use in the next step.
Copy
Ask AI
# Upload the PDF file to Reductoupload = client.upload(file=Path("finance-statement.pdf"))print(f"Uploaded: {upload}")
You can also pass a URL directly to the parse method if your document is already hosted somewhere accessible, like an S3 bucket.
3
Parse the document
Now we call the parse.run() method with the uploaded file reference. This sends the document through Reducto’s processing pipeline, which runs OCR, detects layout, extracts tables, and structures everything into chunks.
Copy
Ask AI
# Parse the uploaded documentresult = client.parse.run(input=upload)# Check what we got backprint(f"Job ID: {result.job_id}")print(f"Pages processed: {result.usage.num_pages}")print(f"Credits used: {result.usage.credits}")print(f"Number of chunks: {len(result.result.chunks)}")
4
Access the extracted content
The response contains chunks, which are logical sections of the document. Each chunk has a content field with the full text and a blocks field with individual elements like tables, headers, and paragraphs.
Copy
Ask AI
# Loop through each chunkfor i, chunk in enumerate(result.result.chunks): print(f"\n=== Chunk {i + 1} ===") print(chunk.content[:500]) # First 500 characters # Look at individual blocks within this chunk for block in chunk.blocks: print(f" [{block.type}] on page {block.bbox.page}") # Tables are returned as HTML by default if block.type == "Table": print(f" Table content: {block.content[:200]}...")
Each block has a type that tells you what kind of content it is: Title, Header, Text, Table, Figure, Key Value, and others. The bbox field contains the bounding box coordinates so you know exactly where on the page this content came from.
Complete code:
Copy
Ask AI
from pathlib import Pathfrom reducto import Reductoclient = Reducto()upload = client.upload(file=Path("finance-statement.pdf"))result = client.parse.run(input=upload)print(f"Processed {result.usage.num_pages} pages")for chunk in result.result.chunks: print(chunk.content) for block in chunk.blocks: if block.type == "Table": print(f"Found table on page {block.bbox.page}")
1
Import the SDK and initialize the client
Import the Reducto client and the fs module for reading files. The client automatically uses the REDUCTO_API_KEY environment variable for authentication.
Copy
Ask AI
import Reducto from 'reductoai';import fs from 'fs';// The client reads REDUCTO_API_KEY from your environmentconst client = new Reducto();
2
Upload your document
Use createReadStream to upload the file to Reducto. This returns a reference you’ll use when calling the parse endpoint.
Copy
Ask AI
// Upload the PDF file to Reductoconst upload = await client.upload({ file: fs.createReadStream("finance-statement.pdf") });console.log(`Uploaded: ${upload}`);
3
Parse the document
Call parse.run() with the uploaded file reference. Reducto processes the document and returns structured content.
Copy
Ask AI
// Parse the uploaded documentconst result = await client.parse.run({ input: upload });console.log(`Job ID: ${result.job_id}`);console.log(`Pages processed: ${result.usage.num_pages}`);console.log(`Credits used: ${result.usage.credits}`);console.log(`Number of chunks: ${result.result.chunks.length}`);
4
Access the extracted content
Loop through the chunks and blocks to access the extracted text, tables, and other elements.
Copy
Ask AI
// Loop through each chunkfor (let i = 0; i < result.result.chunks.length; i++) { const chunk = result.result.chunks[i]; console.log(`\n=== Chunk ${i + 1} ===`); console.log(chunk.content.substring(0, 500)); // Look at individual blocks within this chunk for (const block of chunk.blocks) { console.log(` [${block.type}] on page ${block.bbox.page}`); if (block.type === "Table") { console.log(` Table content: ${block.content.substring(0, 200)}...`); } }}
Complete code:
Copy
Ask AI
import Reducto from 'reductoai';import fs from 'fs';const client = new Reducto();async function main() { const upload = await client.upload({ file: fs.createReadStream("finance-statement.pdf") }); const result = await client.parse.run({ input: upload }); console.log(`Processed ${result.usage.num_pages} pages`); for (const chunk of result.result.chunks) { console.log(chunk.content); for (const block of chunk.blocks) { if (block.type === "Table") { console.log(`Found table on page ${block.bbox.page}`); } } }}main();
1
Import the SDK and initialize the client
Import the Reducto client and the option package for configuration. The Go SDK requires you to pass the API key explicitly using option.WithAPIKey().
Copy
Ask AI
package mainimport ( "context" "fmt" "io" "os" reducto "github.com/reductoai/reducto-go-sdk" "github.com/reductoai/reducto-go-sdk/option" "github.com/reductoai/reducto-go-sdk/shared")func main() { // Initialize client with API key from environment client := reducto.NewClient(option.WithAPIKey(os.Getenv("REDUCTO_API_KEY")))}
2
Upload your document
Open the file and upload it to Reducto. The upload returns a file ID that you’ll use for parsing.
Call Parse.Run() with the file ID. The Go SDK requires you to wrap the file ID with shared.UnionString() and then with reducto.F[...]() because the SDK uses strongly-typed union parameters.
Copy
Ask AI
result, err := client.Parse.Run(context.Background(), reducto.ParseRunParams{ ParseConfig: reducto.ParseConfigParam{ // The file ID must be wrapped in shared.UnionString() and reducto.F[...]() DocumentURL: reducto.F[reducto.ParseConfigDocumentURLUnionParam]( shared.UnionString(upload.FileID), ), },})if err != nil { fmt.Printf("Parse error: %v\n", err) return}fmt.Printf("Job ID: %s\n", result.JobID)fmt.Printf("Pages: %d\n", result.Usage.NumPages)// Note: To view in Studio, construct the URL: https://studio.reducto.ai/job/{job_id}
4
Access the extracted content
The result contains chunks with extracted content. The Chunks field is typed as interface{}, so you need to type assert it to []shared.ParseResponseResultFullResultChunk before you can iterate over it. When checking block types, use the SDK constants instead of string comparisons.
Copy
Ask AI
if result.Result.Type == shared.ParseResponseResultTypeFull { // Type assert Chunks from interface{} to the actual type chunks, ok := result.Result.Chunks.([]shared.ParseResponseResultFullResultChunk) if ok { for _, chunk := range chunks { fmt.Println(chunk.Content) for _, block := range chunk.Blocks { // Use SDK constants for block type comparisons if block.Type == shared.ParseResponseResultFullResultChunksBlocksTypeTable { fmt.Printf("Found table on page %d\n", block.Bbox.Page) } } } }}
The default settings work well for most documents, but you can customize the parsing behavior for specific use cases.
Python
Node.js
Go
cURL
Copy
Ask AI
result = client.parse.run( input=upload, enhance={ # Use AI to clean up OCR errors in scanned documents "agentic": [{"scope": "text"}], # Generate descriptions for charts and images "summarize_figures": True }, formatting={ # Get tables as HTML, markdown, json, or csv "table_output_format": "markdown" }, settings={ # Only process pages 1-5 "page_range": {"start": 1, "end": 5} })
enhance.agentic: Runs AI-powered cleanup on the specified scope. Use "text" for OCR correction on scanned documents, or "table" to improve table structure detection.
enhance.summarize_figures: Generates natural language descriptions of charts, graphs, and images. Useful for RAG pipelines where you need to search figure content.
formatting.table_output_format: Controls how tables are returned. Options are html, markdown, json, csv, or ai_json for complex tables that need AI reconstruction.
settings.page_range: Limits processing to specific pages. Useful for large documents where you only need certain sections.
This means your API key is missing or invalid. Check that the REDUCTO_API_KEY environment variable is set correctly and that the key hasn’t expired in Studio.
Tables aren't structured correctly
Some complex tables need extra help. Try setting formatting.table_output_format to "ai_json", or enable enhance.agentic with [{"scope": "table"}] for AI-powered table reconstruction.
Content is missing or garbled
For scanned documents or low-quality PDFs, enable the agentic text enhancement: enhance.agentic: [{"scope": "text"}]. If the document is password-protected, pass the password in settings.document_password. This may also be due to bad metadata polluting the output, in which case, reach out to Reducto support.
Every response includes a studio_link that opens the job in Reducto Studio. Use it to visually inspect what was extracted and debug any issues.