Get Started
Documentation
Response Format
Learn about the response format for our parsing pipeline.
The parsing response format is optimized for flexibility with retrieval augmented generation. If you are just looking for a markdown representation of a given document, you can disable chunking altogether and just use response['result']['chunks'][0]['content']
.
{
"result": {
"type": "full",
"chunks": [
{
"content": "Chunk content optimized for passing to an LLM.",
"embed": "Chunk content optimized for passing to an embedding model.",
"blocks": [
{
"type": "Text", // Block type (Text, Table, Figure, etc.)
"bbox": {
// All bbox values normalized to [0,1] range
"left": 0.1, // Distance from left edge (10%)
"top": 0.2, // Distance from top edge (20%)
"width": 0.3, // Width as % of page width (30%)
"height": 0.4, // Height as % of page height (40%)
"page": 1, // Current page number (1-indexed)
"original_page": 10 // Original doc page number
},
"content": "Text content",
"image_url": null // presigned url to download figure/table image
}
// ...
]
}
// ...
]
}
}
FAQ
We allow you to specify the page range within the original document you want to parse. This is controlled by advanced_options
-> page_range
.
In these cases, it is useful to know the page number of the block within the returned context (e.g. of the pages parsed, this block was the 1st page) as well as the original page number in the source document (e.g. of the original document, this block was the 10th page).
For the most part these sections are actually the same. However, reducto’s API can apply improvements for optimizing for embedding performance. One example is the table summarization feature. In this case, the embed
field contains the summarized table content, while the content
field contains the original table content. We have found that this improves the downstream performance of the embedding models which are not as capable of reasoning over complex tabular HTML.
For longer documents, the API may return a URL instead of the full result. In this case, the result type will be url
and contain a URL pointing to a JSON array of chunks. For smaller documents, the result type will be full
and contain the chunks directly in the response. The chunks have the same structure in both cases - the only difference is whether they are returned directly or need to be fetched from a URL. You can read more about the URL response format here.
You can enable the return_figure_images
parameter in the experimental options to get image URLs for figures in the document. Similarly, you can enable the return_table_images
parameter to get image URLs for tables. When enabled, the corresponding blocks will include an image_url
field that points to an image of the figure or table.