settings config group controls how documents are processed: which OCR system to use, how long to wait, what to include in the response, and how to handle special cases like password-protected files.
OCR System
Reducto offers two OCR systems that determine how text is extracted from images and scanned documents.standard (default): Our primary OCR engine supporting 100+ languages including English, Spanish, French, German, Chinese, Japanese, Korean, Arabic, Hindi, and many more. Handles mixed-language documents automatically.
legacy: An older engine that only supports Germanic languages (English, German, Dutch, Norwegian, Swedish, Danish). Available for backwards compatibility with existing integrations. Use standard for new projects.
The Go SDK uses different OCR system values (
highres, multilingual, combined). Use Python, Node.js, or cURL for the standard and legacy options.Extraction Mode
Controls how text is extracted from PDFs that have embedded text layers.hybrid (default): Uses good quality metadata first, then OCR. Best when processing mixed document sets where some have reliable embedded text and others don’t.
ocr: Uses optical character recognition only, ignoring any embedded text in the PDF. Best for scanned documents, images, or when embedded text is unreliable or corrupted.
metadata: Uses embedded text from PDF metadata only, without OCR. Best for native DOCX/PDFs with reliable text layers where you want faster processing.
Page Range
Process only specific pages to save time and credits. See Page Ranges for complete documentation.Timeout
Set a maximum processing time in seconds. If processing exceeds this limit, the request fails rather than hanging indefinitely.Password-Protected Documents
For encrypted PDFs that require a password to open:Return Images
By default, blocks contain only extracted text. Enablereturn_images to get pre-signed URLs pointing to cropped images of specific block types:
image_url field:
figure: Cropped images for figure blocks (charts, diagrams, photos, illustrations)table: Cropped images for table blocks
Return OCR Data
Returns the raw OCR output with word-level and line-level bounding boxes. This gives you access to the underlying text extraction before Reducto’s layout analysis.result object includes an ocr field containing words and lines arrays:
text: The recognized textbbox: Normalized bounding box (coordinates as fractions of page dimensions)confidence: OCR confidence score between 0 and 1rotation: Detected rotation angle in degrees (0-360, counterclockwise)
Persist Results
By default, processed results are stored temporarily and eventually deleted. Enable persistence to keep results indefinitely in long-term storage:job_id that serves as the retrieval key:
Requires opting in to Reducto Studio. Contact support to enable this feature for your organization.
Embed PDF Metadata
Embeds the OCR-extracted text back into the PDF as a hidden text layer. The response includes apdf_url pointing to the enhanced PDF:
- Text selection and copy/paste in PDF viewers
- Full-text search within the document
- Accessibility features (screen readers can read the text)
Force URL Result
By default, Reducto returns the full result inline in the response. For very large documents, this is automatically switched to a URL. You can force URL mode regardless of size:Force File Extension
Reducto automatically detects file types from URLs and content. Override this detection when automatic detection fails or returns incorrect results:- URLs without file extensions (e.g.,
https://api.example.com/document/12345) - URLs with misleading extensions
- Pre-signed URLs with complex query parameters that confuse detection
.pdf, .png, .jpg, .docx, .xlsx, .pptx, and all other supported file types.