Processing Settings

The settings config group controls how documents are processed: which OCR system to use, how long to wait, what to include in the response, and how to handle special cases like password-protected files.

result = client.parse.run(
    input=upload.file_id,
    settings={
        "ocr_system": "standard",
        "timeout": 300,
        "page_range": {"start": 1, "end": 50}
    }
)

OCR System

Reducto offers two OCR systems that determine how text is extracted from images and scanned documents.

settings={"ocr_system": "standard"}

standard (default): Our primary OCR engine supporting 60+ languages. Handles mixed-language documents automatically.

Supported languages (standard OCR)

Afrikaans, Albanian, Arabic, Armenian, Belarusian, Bengali, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Filipino, Finnish, French, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Khmer, Korean, Lao, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Marathi, Nepali, Norwegian, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian, Vietnamese, Yiddish

legacy: An older engine optimized for Germanic languages only. Available for backwards compatibility with existing integrations. Use standard for new projects.

Supported languages (legacy OCR)

English, German, Dutch, Norwegian, Swedish, Danish, Icelandic, Afrikaans

The Go SDK uses different OCR system values (highres, multilingual, combined). Use Python, Node.js, or cURL for the standard and legacy options.

For maximum accuracy on difficult documents (handwriting, faded text, poor scans), combine with agentic text mode:

result = client.parse.run(
    input=upload.file_id,
    settings={"ocr_system": "standard"},
    enhance={"agentic": [{"scope": "text"}]}
)

See Agentic Modes for details on when to enable this.

Extraction Mode

Controls how text is extracted from PDFs that have embedded text layers.

settings={"extraction_mode": "hybrid"}

hybrid (default): Uses good quality metadata first, then OCR. Best when processing mixed document sets where some have reliable embedded text and others don’t. ocr: Uses optical character recognition only, ignoring any embedded text in the PDF. Best for scanned documents, images, or when embedded text is unreliable or corrupted. metadata: Uses embedded text from PDF metadata only, without OCR. Best for native DOCX/PDFs with reliable text layers where you want faster processing.

Page Range

Process only specific pages to save time and credits. See Page Ranges for complete documentation.

# Pages 1-10 only
settings={"page_range": {"start": 1, "end": 10}}

# Multiple ranges
settings={"page_range": [{"start": 1, "end": 5}, {"start": 20, "end": 25}]}

Timeout

Set a maximum processing time in seconds. If processing exceeds this limit, the request fails rather than hanging indefinitely.

settings={"timeout": 300}  # 5 minutes

If not specified, Reducto uses internal defaults appropriate for the document size.

Password-Protected Documents

For encrypted PDFs that require a password to open:

settings={"document_password": "secret123"}

The password is used to decrypt the document before processing. It’s transmitted securely but not stored.

Return Images

By default, blocks contain only extracted text. Enable return_images to get pre-signed URLs pointing to cropped images of specific block types:

settings={"return_images": ["figure", "table"]}

When enabled, applicable blocks include an image_url field:

{
  "type": "Figure",
  "bbox": {"left": 0.1, "top": 0.2, "width": 0.8, "height": 0.4, "page": 1},
  "content": "Bar chart showing quarterly revenue growth from Q1 to Q4...",
  "image_url": "https://storage.reducto.ai/figures/abc123.png?X-Amz-Expires=3600..."
}

The URL is a pre-signed S3 link valid for a limited time. Download or process the image before expiration. Options:

figure: Cropped images for figure blocks (charts, diagrams, photos, illustrations)
table: Cropped images for table blocks

Return OCR Data

Returns the raw OCR output with word-level and line-level bounding boxes. This gives you access to the underlying text extraction before Reducto’s layout analysis.

settings={"return_ocr_data": True}

The response result object includes an ocr field containing words and lines arrays:

{
  "job_id": "parse_abc123xyz",
  "result": {
    "type": "full",
    "chunks": [...],
    "ocr": {
      "words": [
        {
          "text": "Revenue",
          "bbox": {"left": 0.12, "top": 0.08, "width": 0.15, "height": 0.02, "page": 1},
          "confidence": 0.98,
          "rotation": 0
        }
      ],
      "lines": [
        {
          "text": "Revenue Report Q4 2024",
          "bbox": {"left": 0.12, "top": 0.08, "width": 0.45, "height": 0.02, "page": 1},
          "confidence": 0.97,
          "rotation": 0
        }
      ]
    }
  }
}

Each word and line includes:

text: The recognized text
bbox: Normalized bounding box (coordinates as fractions of page dimensions)
confidence: OCR confidence score between 0 and 1
rotation: Detected rotation angle in degrees (0-360, counterclockwise)

Persist Results

By default, processed results are stored temporarily and eventually deleted. Enable persistence to keep results indefinitely in long-term storage:

settings={"persist_results": True}

When enabled, you can retrieve results later using the job ID without reprocessing the document. The response includes a job_id that serves as the retrieval key:

{
  "job_id": "parse_abc123xyz",
  "duration": 2.34,
  "result": {...}
}

Retrieve stored results later:

job = client.job.get("parse_abc123xyz")
result = job.result

Requires opting in to Reducto Studio. Contact support to enable this feature for your organization.

Embed PDF Metadata

Embeds the OCR-extracted text back into the PDF as a hidden text layer. The response includes a pdf_url pointing to the enhanced PDF:

settings={"embed_pdf_metadata": True}

{
  "job_id": "parse_abc123xyz",
  "pdf_url": "https://storage.reducto.ai/pdfs/abc123.pdf?...",
  "result": {...}
}

The returned PDF looks identical to the original but now supports:

Text selection and copy/paste in PDF viewers
Full-text search within the document
Accessibility features (screen readers can read the text)

Force URL Result

By default, Reducto returns the full result inline in the response. For very large documents, this is automatically switched to a URL. You can force URL mode regardless of size:

settings={"force_url_result": True}

When enabled, the response contains a URL instead of inline content:

{
  "job_id": "parse_abc123xyz",
  "result": {
    "type": "url",
    "url": "https://storage.reducto.ai/results/abc123.json?...",
    "result_id": "abc123"
  }
}

Fetch the full result by downloading from the URL. The URL is pre-signed and valid for a limited time.

Force File Extension

Reducto automatically detects file types from URLs and content. Override this detection when automatic detection fails or returns incorrect results:

settings={"force_file_extension": ".pdf"}

Common scenarios:

URLs without file extensions (e.g., https://api.example.com/document/12345)
URLs with misleading extensions
Pre-signed URLs with complex query parameters that confuse detection

Valid extensions include .pdf, .png, .jpg, .docx, .xlsx, .pptx, and all other supported file types.

Get Started

Core Functions

Workflows and Pipelines

Configurations

Reference

Components

Enterprise Resources

Security and privacy

On-premise Resources

OCR System

Extraction Mode

Page Range

Timeout

Password-Protected Documents

Return Images

Return OCR Data

Persist Results

Embed PDF Metadata

Force URL Result

Force File Extension

Get Started

Core Functions

Workflows and Pipelines

Configurations

Reference

Components

Enterprise Resources

Security and privacy

On-premise Resources

​OCR System

​Extraction Mode

​Page Range

​Timeout

​Password-Protected Documents

​Return Images

​Return OCR Data

​Persist Results

​Embed PDF Metadata

​Force URL Result

​Force File Extension

OCR System

Extraction Mode

Page Range

Timeout

Password-Protected Documents

Return Images

Return OCR Data

Persist Results

Embed PDF Metadata

Force URL Result

Force File Extension