Async Parse
import os
from reducto import Reducto
client = Reducto(
api_key=os.environ.get("REDUCTO_API_KEY"), # This is the default and can be omitted
)
response = client.parse.run_job(
document_url="string",
)
print(response.job_id)
{
"job_id": "<string>"
}
Authorizations
Bearer authentication header of the form Bearer <token>
, where <token>
is your auth token.
Body
The URL of the document to be processed. You can provide one of the following:
- A publicly available URL
- A presigned S3 URL
- A reducto:// prefixed URL obtained from the /upload endpoint after directly uploading a document
The mode to use for OCR. If agentic is enabled, at a small cost table OCR will be automatically edited.
standard
, agentic
The mode to use for extraction.
ocr
, metadata
, hybrid
The configuration options for chunking.
The mode to use for chunking. Section chunks according to sections in the document. Page chunks according to pages. Disabled returns a single chunk.
variable
, section
, page
, block
, disabled
The approximate size of chunks (in characters) that the document will be split into. Defaults to None, in which case the chunk size is variable between 250 - 1500 characters.
The configuration options for figure summarization.
If figure summarization should be performed.
Add information to the prompt for figure summarization.
If the figure summary prompt should override our default prompt.
A list of block types to filter from chunk content.
Header
, Footer
, Title
, Section Header
, Page Number
, List Item
, Figure
, Table
, Key Value
, Text
, Comment
, Discard
Force the result to be returned in URL form (by default only used for very large responses).
The OCR system to use. Highres is recommended for documents with English characters.
highres
, multilingual
, combined
The mode to use for table output. Dynamic returns md for simpler tables and html for more complex tables.
html
, json
, md
, jsonbbox
, dynamic
, ai_json
A flag to indicate if consecutive tables with the same number of columns should be merged.
A flag to indicate if the hierarchy of the document should be continued from chunk to chunk.
If line breaks should be preserved in the text.
Force the URL to be downloaded as a specific file extension (e.g. .png).
The configuration options for large table chunking (currently only supported on spreadsheet and CSV files).
If large tables should be chunked into smaller tables, currently only supported on spreadsheet and CSV files.
The max row/column size for a table to be chunked. Defaults to 50. Header rows/columns are persisted based on heuristics.
On a spreadsheet, the algorithm that is used to split up sheets into multiple tables.
default
, disabled
If True, add page markers to the output (e.g. [[PAGE 1 BEGINS HERE]] and [[PAGE 1 ENDS HERE]] added as blocks to the content). Defaults to False.
If True, remove text formatting from the output (e.g. hyphens for list items). Defaults to False.
If True, return OCR data in the result. Defaults to False.
Password to decrypt password-protected documents.
The configuration options for enrichment.
If enabled, a large language/vision model will be used to postprocess the extracted content. Note: enabling enrich requires tables be outputted in markdown format. Defaults to False.
Add information to the prompt for enrichment.
Instead of using LibreOffice, when enabled, this flag uses a Windows VM to convert files. This is slower but more accurate.
Use an experimental checkbox detection model to add checkboxes to the output, defaults to False
Use an experimental equation detection model to add equations to the output, defaults to False
Use an orientation model to detect and rotate pages as needed, defaults to True
Add <u> tag around text that's underlined and surround strikethroughs and underlines with <change> tags, defaults to False
Add <sub> tag around subscripts and <sup> tag around superscripts, defaults to False
If figure images should be returned in the result. Defaults to False.
You probably shouldn't use this. If True, filter out boxes with width greater than 50% of the document width. Defaults to False. You probably don't want to use this.
The mode to use for webhook delivery. Defaults to 'disabled'. We recommend using 'svix' for production environments.
disabled
, svix
, direct
The URL to send the webhook to (if using direct webhoook).
JSON metadata included in webhook request body
A list of Svix channels the message will be delivered down, omit to send to all channels.
If True, attempts to process the job with priority if the user has priority processing budget available; by default, sync jobs are prioritized above async jobs.
Response
import os
from reducto import Reducto
client = Reducto(
api_key=os.environ.get("REDUCTO_API_KEY"), # This is the default and can be omitted
)
response = client.parse.run_job(
document_url="string",
)
print(response.job_id)
{
"job_id": "<string>"
}