> ## Documentation Index
> Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Parse

> Convert documents into structured JSON with text, tables, and figures

Parse converts your documents into structured JSON. It runs OCR, detects document layout (headers, paragraphs, tables, figures), and returns content organized into chunks for LLM and RAG workflows.

Each element includes its type, page position, and confidence score. It can handle multi-column text, nested tables, forms with handwriting, rotated pages, and documents mixing text with charts and images.

<Tip>
  **Try it live:** See Parse in action with a [sample bank statement in Reducto Studio](https://studio.reducto.ai).
</Tip>

**File size limits:** Upload files up to 100MB directly via the [Upload endpoint](/upload), or up to 5GB via [presigned URL](/upload/large-files). You can also pass public URLs or presigned S3/GCS/Azure URLs directly.

***

## Quick Start

<CodeGroup>
  ```python Python theme={null}
  from pathlib import Path
  from reducto import Reducto

  client = Reducto()

  upload = client.upload(file=Path("invoice.pdf"))
  result = client.parse.run(input=upload.file_id)

  for chunk in result.result.chunks:
      print(chunk.content)
  ```

  ```javascript Node.js theme={null}
  import Reducto from 'reductoai';
  import fs from 'fs';

  const client = new Reducto();

  const upload = await client.upload({
    file: fs.createReadStream('invoice.pdf'),
  });
  const result = await client.parse.run({ input: upload.file_id });

  for (const chunk of result.result.chunks) {
    console.log(chunk.content);
  }
  ```

  ```go Go theme={null}
  package main

  import (
      "context"
      "fmt"
      "io"
      "os"

      reducto "github.com/reductoai/reducto-go-sdk"
      "github.com/reductoai/reducto-go-sdk/option"
      "github.com/reductoai/reducto-go-sdk/shared"
  )

  func main() {
      client := reducto.NewClient(option.WithAPIKey(os.Getenv("REDUCTO_API_KEY")))

      file, _ := os.Open("invoice.pdf")
      defer file.Close()
      upload, _ := client.Upload(context.Background(), reducto.UploadParams{
          File: reducto.F[io.Reader](file),
      })

      result, _ := client.Parse.Run(context.Background(), reducto.ParseRunParams{
          ParseConfig: reducto.ParseConfigParam{
              DocumentURL: reducto.F[reducto.ParseConfigDocumentURLUnionParam](
                  shared.UnionString(upload.FileID),
              ),
          },
      })

      if result.Result.Type == shared.ParseResponseResultTypeFull {
          chunks := result.Result.Chunks.([]shared.ParseResponseResultFullResultChunk)
          for _, chunk := range chunks {
              fmt.Println(chunk.Content)
          }
      }
  }
  ```

  ```bash cURL theme={null}
  # Upload
  FILE_ID=$(curl -s -X POST https://platform.reducto.ai/upload \
    -H "Authorization: Bearer $REDUCTO_API_KEY" \
    -F "file=@invoice.pdf" | jq -r '.file_id')

  # Parse
  curl -X POST https://platform.reducto.ai/parse \
    -H "Authorization: Bearer $REDUCTO_API_KEY" \
    -H "Content-Type: application/json" \
    -d "{\"input\": \"$FILE_ID\"}"
  ```
</CodeGroup>

***

## What You Get Back

```json theme={null}
{
  "job_id": "7600c8c5-a52f-49d2-8a7d-d75d1b51e141",
  "duration": 3.89,
  "result": {
    "type": "full",
    "chunks": [
      {
        "content": "# Invoice\n\nBill To: Acme Corp\n123 Main St...",
        "embed": "# Invoice\n\nBill To: Acme Corp...",
        "blocks": [
          {
            "type": "Title",
            "content": "Invoice",
            "bbox": { "left": 0.1, "top": 0.05, "width": 0.3, "height": 0.04, "page": 1 },
            "confidence": "high"
          }
        ]
      }
    ]
  },
  "usage": { "num_pages": 1, "credits": 2.0 },
  "studio_link": "https://studio.reducto.ai/job/7600c8c5-..."
}
```

**Key fields:**

| Field              | What it is                                                                                                                                                         |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `chunks[].content` | The extracted content, formatted as Markdown (headers become `#`, tables become Markdown/HTML tables). Ready to pass to an LLM.                                    |
| `chunks[].embed`   | Same content but optimized for embeddings. When figure/table summaries are enabled, this field contains natural language descriptions instead of raw table markup. |
| `chunks[].blocks`  | The individual elements (paragraphs, tables, figures) with their positions and types. Useful for highlighting or linking back to source.                           |
| `result.type`      | Either `"full"` (content inline) or `"url"` (content at a URL). Large documents return `"url"` to avoid HTTP size limits.                                          |

<Card title="Response Format Details" icon="brackets-curly" href="/parse/response-format">
  Full breakdown of chunks, blocks, bounding boxes, and confidence scores.
</Card>

***

## Input Options

The `input` field accepts four formats:

1. **Upload response** (`reducto://...`): After uploading via `/upload`, use the returned `file_id`. This is the most common method for local files.
2. **Public URL**: Any publicly accessible URL. Reducto fetches the file directly.
3. **Presigned URL**: S3, GCS, or Azure Blob presigned URLs work. Useful when files are in your cloud storage.
4. **Previous job ID** (`jobid://...`): Reprocess a document from a previous parse job without re-uploading. Useful for testing different configurations.

<CodeGroup>
  ```python Python theme={null}
  # From upload
  result = client.parse.run(input=upload.file_id)

  # Public URL
  result = client.parse.run(input="https://example.com/doc.pdf")

  # Presigned S3 URL  
  result = client.parse.run(input="https://bucket.s3.amazonaws.com/doc.pdf?X-Amz-...")

  # Reprocess previous job
  result = client.parse.run(input="jobid://7600c8c5-a52f-49d2-8a7d-d75d1b51e141")
  ```

  ```javascript Node.js theme={null}
  // From upload
  const result = await client.parse.run({ input: upload.file_id });

  // Public URL
  const result = await client.parse.run({ input: 'https://example.com/doc.pdf' });

  // Presigned S3 URL
  const result = await client.parse.run({ input: 'https://bucket.s3.amazonaws.com/doc.pdf?X-Amz-...' });

  // Reprocess previous job
  const result = await client.parse.run({ input: 'jobid://7600c8c5-a52f-49d2-8a7d-d75d1b51e141' });
  ```

  ```go Go theme={null}
  // From upload
  result, _ := client.Parse.Run(context.Background(), reducto.ParseRunParams{
      ParseConfig: reducto.ParseConfigParam{
          DocumentURL: reducto.F[reducto.ParseConfigDocumentURLUnionParam](
              shared.UnionString(upload.FileID),
          ),
      },
  })

  // Public URL
  result, _ := client.Parse.Run(context.Background(), reducto.ParseRunParams{
      ParseConfig: reducto.ParseConfigParam{
          DocumentURL: reducto.F[reducto.ParseConfigDocumentURLUnionParam](
              shared.UnionString("https://example.com/doc.pdf"),
          ),
      },
  })
  ```

  ```bash cURL theme={null}
  # From upload
  curl -X POST https://platform.reducto.ai/parse \
    -H "Authorization: Bearer $REDUCTO_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"input": "reducto://your-file-id"}'

  # Public URL
  curl -X POST https://platform.reducto.ai/parse \
    -H "Authorization: Bearer $REDUCTO_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"input": "https://example.com/doc.pdf"}'

  # Reprocess previous job
  curl -X POST https://platform.reducto.ai/parse \
    -H "Authorization: Bearer $REDUCTO_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"input": "jobid://7600c8c5-a52f-49d2-8a7d-d75d1b51e141"}'
  ```
</CodeGroup>

***

## Sync vs Async

Parse has both synchronous (`/parse`) and asynchronous (`/parse_async`) endpoints. Use async for large documents or when you need webhook delivery.

<Card title="Sync vs Async Guide" icon="clock" href="/workflows/async-overview">
  When to use each, how priority works, webhook setup.
</Card>

***

## Configuration

Parse has several configuration groups. Here are the most commonly changed options:

### Chunking

By default, Parse returns the entire document as one chunk. For RAG applications, you want smaller chunks that can be embedded and retrieved independently.

<CodeGroup>
  ```python Python theme={null}
  result = client.parse.run(
      input=upload.file_id,
      retrieval={
          "chunking": {"chunk_mode": "variable"}
      }
  )
  ```

  ```javascript Node.js theme={null}
  const result = await client.parse.run({
    input: upload.file_id,
    retrieval: {
      chunking: { chunk_mode: 'variable' }
    }
  });
  ```

  ```go Go theme={null}
  result, _ := client.Parse.Run(context.Background(), reducto.ParseRunParams{
      ParseConfig: reducto.ParseConfigParam{
          DocumentURL: reducto.F[reducto.ParseConfigDocumentURLUnionParam](
              shared.UnionString(upload.FileID),
          ),
      },
      Retrieval: reducto.F(reducto.RetrievalParam{
          Chunking: reducto.F(reducto.RetrievalChunkingUnionParam{
              OfVariableChunking: &reducto.VariableChunkingConfigParam{
                  ChunkMode: reducto.F(reducto.VariableChunkingConfigChunkModeVariable),
              },
          }),
      }),
  })
  ```

  ```bash cURL theme={null}
  curl -X POST https://platform.reducto.ai/parse \
    -H "Authorization: Bearer $REDUCTO_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "input": "reducto://your-file-id",
      "retrieval": {
        "chunking": {"chunk_mode": "variable"}
      }
    }'
  ```
</CodeGroup>

| Mode       | Behavior                                                                             |
| ---------- | ------------------------------------------------------------------------------------ |
| `disabled` | One chunk for the whole document (default)                                           |
| `variable` | Splits at semantic boundaries (sections, tables, figures stay intact). Best for RAG. |
| `page`     | One chunk per page                                                                   |
| `section`  | Splits at section headers                                                            |

[Full chunking options →](/configs/parse/chunking-methods)

### Table Output Format

Controls how tables appear in the output.

<CodeGroup>
  ```python Python theme={null}
  result = client.parse.run(
      input=upload.file_id,
      formatting={
          "table_output_format": "html"
      }
  )
  ```

  ```javascript Node.js theme={null}
  const result = await client.parse.run({
    input: upload.file_id,
    formatting: {
      table_output_format: 'html'
    }
  });
  ```

  ```go Go theme={null}
  result, _ := client.Parse.Run(context.Background(), reducto.ParseRunParams{
      ParseConfig: reducto.ParseConfigParam{
          DocumentURL: reducto.F[reducto.ParseConfigDocumentURLUnionParam](
              shared.UnionString(upload.FileID),
          ),
      },
      Formatting: reducto.F(reducto.FormattingParam{
          TableOutputFormat: reducto.F(reducto.FormattingTableOutputFormatHTML),
      }),
  })
  ```

  ```bash cURL theme={null}
  curl -X POST https://platform.reducto.ai/parse \
    -H "Authorization: Bearer $REDUCTO_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "input": "reducto://your-file-id",
      "formatting": {
        "table_output_format": "html"
      }
    }'
  ```
</CodeGroup>

| Format    | When to use                                                       |
| --------- | ----------------------------------------------------------------- |
| `dynamic` | Auto-selects HTML or Markdown based on table complexity (default) |
| `html`    | Complex tables with merged cells, nested headers                  |
| `md`      | Simple tables, Markdown-based workflows                           |
| `json`    | Programmatic processing, need cell-level access                   |
| `csv`     | Export to spreadsheets                                            |

[Full table format options →](/configs/parse/table-output-formats)

### Figure Summaries

By default, Parse uses a vision model to generate descriptions for figures and images. This helps with RAG (the `embed` field contains the description) but adds latency.

<CodeGroup>
  ```python Python theme={null}
  result = client.parse.run(
      input=upload.file_id,
      enhance={
          "summarize_figures": True
      }
  )
  ```

  ```javascript Node.js theme={null}
  const result = await client.parse.run({
    input: upload.file_id,
    enhance: {
      summarize_figures: true
    }
  });
  ```

  ```go Go theme={null}
  result, _ := client.Parse.Run(context.Background(), reducto.ParseRunParams{
      ParseConfig: reducto.ParseConfigParam{
          DocumentURL: reducto.F[reducto.ParseConfigDocumentURLUnionParam](
              shared.UnionString(upload.FileID),
          ),
      },
      Enhance: reducto.F(reducto.EnhanceParam{
          SummarizeFigures: reducto.F(true),
      }),
  })
  ```

  ```bash cURL theme={null}
  curl -X POST https://platform.reducto.ai/parse \
    -H "Authorization: Bearer $REDUCTO_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "input": "reducto://your-file-id",
      "enhance": {
        "summarize_figures": true
      }
    }'
  ```
</CodeGroup>

### Agentic Mode

Uses an LLM to review and correct parsing output. Adds latency with additional credit usage. Enable it when:

* **`scope: "text"`**: Handwritten text, faded scans, documents with unusual fonts, or when you see garbled characters in the output.
* **`scope: "table"`**: Tables with misaligned columns, merged cells that didn't parse correctly, or numbers that appear in wrong columns.
* **`scope: "figure"`**: Charts and graphs that need data extraction, including [advanced chart extraction](https://reducto.ai/blog/reducto-chart-extraction) with structured data output.

<CodeGroup>
  ```python Python theme={null}
  result = client.parse.run(
      input=upload.file_id,
      enhance={
          "agentic": [
              {"scope": "text"},
              {"scope": "table"},
              {"scope": "figure", "advanced_chart_agent": True}
          ]
      }
  )
  ```

  ```javascript Node.js theme={null}
  const result = await client.parse.run({
    input: upload.file_id,
    enhance: {
      agentic: [
        { scope: 'text' },
        { scope: 'table' },
        { scope: 'figure', advanced_chart_agent: true }
      ]
    }
  });
  ```

  ```go Go theme={null}
  result, _ := client.Parse.Run(context.Background(), reducto.ParseRunParams{
      ParseConfig: reducto.ParseConfigParam{
          DocumentURL: reducto.F[reducto.ParseConfigDocumentURLUnionParam](
              shared.UnionString(upload.FileID),
          ),
      },
      Enhance: reducto.F(reducto.EnhanceParam{
          Agentic: reducto.F([]reducto.AgenticScopeParam{
              {Scope: reducto.F(reducto.AgenticScopeScopeText)},
              {Scope: reducto.F(reducto.AgenticScopeScopeTable)},
              {Scope: reducto.F(reducto.AgenticScopeScopeFigure), AdvancedChartAgent: reducto.F(true)},
          }),
      }),
  })
  ```

  ```bash cURL theme={null}
  curl -X POST https://platform.reducto.ai/parse \
    -H "Authorization: Bearer $REDUCTO_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "input": "reducto://your-file-id",
      "enhance": {
        "agentic": [
          {"scope": "text"},
          {"scope": "table"},
          {"scope": "figure", "advanced_chart_agent": true}
        ]
      }
    }'
  ```
</CodeGroup>

Don't enable for clean digital PDFs (native text, not scanned). They parse correctly without it and you'll just add latency.

### Filter Blocks

Remove specific content types from the output. The blocks still appear in `blocks` metadata but are excluded from `content` and `embed`.

<CodeGroup>
  ```python Python theme={null}
  result = client.parse.run(
      input=upload.file_id,
      retrieval={
          "filter_blocks": ["Header", "Footer", "Page Number"]
      }
  )
  ```

  ```javascript Node.js theme={null}
  const result = await client.parse.run({
    input: upload.file_id,
    retrieval: {
      filter_blocks: ['Header', 'Footer', 'Page Number']
    }
  });
  ```

  ```go Go theme={null}
  result, _ := client.Parse.Run(context.Background(), reducto.ParseRunParams{
      ParseConfig: reducto.ParseConfigParam{
          DocumentURL: reducto.F[reducto.ParseConfigDocumentURLUnionParam](
              shared.UnionString(upload.FileID),
          ),
      },
      Retrieval: reducto.F(reducto.RetrievalParam{
          FilterBlocks: reducto.F([]reducto.RetrievalFilterBlock{
              reducto.RetrievalFilterBlockHeader,
              reducto.RetrievalFilterBlockFooter,
              reducto.RetrievalFilterBlockPageNumber,
          }),
      }),
  })
  ```

  ```bash cURL theme={null}
  curl -X POST https://platform.reducto.ai/parse \
    -H "Authorization: Bearer $REDUCTO_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "input": "reducto://your-file-id",
      "retrieval": {
        "filter_blocks": ["Header", "Footer", "Page Number"]
      }
    }'
  ```
</CodeGroup>

Useful for RAG when headers/footers would pollute search results.

### Page Range

Process only specific pages.

<CodeGroup>
  ```python Python theme={null}
  result = client.parse.run(
      input=upload.file_id,
      settings={
          "page_range": {"start": 1, "end": 10}
      }
  )
  ```

  ```javascript Node.js theme={null}
  const result = await client.parse.run({
    input: upload.file_id,
    settings: {
      page_range: { start: 1, end: 10 }
    }
  });
  ```

  ```go Go theme={null}
  result, _ := client.Parse.Run(context.Background(), reducto.ParseRunParams{
      ParseConfig: reducto.ParseConfigParam{
          DocumentURL: reducto.F[reducto.ParseConfigDocumentURLUnionParam](
              shared.UnionString(upload.FileID),
          ),
      },
      Settings: reducto.F(reducto.SettingsParam{
          PageRange: reducto.F(reducto.PageRangeParam{
              Start: reducto.F(int64(1)),
              End:   reducto.F(int64(10)),
          }),
      }),
  })
  ```

  ```bash cURL theme={null}
  curl -X POST https://platform.reducto.ai/parse \
    -H "Authorization: Bearer $REDUCTO_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "input": "reducto://your-file-id",
      "settings": {
        "page_range": {"start": 1, "end": 10}
      }
    }'
  ```
</CodeGroup>

### Return Images

Get image URLs for figures and tables in the document.

<CodeGroup>
  ```python Python theme={null}
  result = client.parse.run(
      input=upload.file_id,
      settings={
          "return_images": ["figure", "table"]
      }
  )

  # Access images from blocks
  for chunk in result.result.chunks:
      for block in chunk.blocks:
          if block.image_url:
              print(f"{block.type}: {block.image_url}")
  ```

  ```javascript Node.js theme={null}
  const result = await client.parse.run({
    input: upload.file_id,
    settings: {
      return_images: ['figure', 'table']
    }
  });

  // Access images from blocks
  for (const chunk of result.result.chunks) {
    for (const block of chunk.blocks) {
      if (block.image_url) {
        console.log(`${block.type}: ${block.image_url}`);
      }
    }
  }
  ```

  ```bash cURL theme={null}
  curl -X POST https://platform.reducto.ai/parse \
    -H "Authorization: Bearer $REDUCTO_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "input": "reducto://your-file-id",
      "settings": {
        "return_images": ["figure", "table"]
      }
    }'
  ```
</CodeGroup>

Options: `["figure"]`, `["table"]`, or `["figure", "table"]`. By default, no images are returned.

### Additional Settings

| Setting              | Default | Description                                                     |
| -------------------- | ------- | --------------------------------------------------------------- |
| `persist_results`    | `false` | Keep results indefinitely instead of expiring after 24 hours    |
| `timeout`            | `null`  | Custom timeout in seconds for processing                        |
| `force_url_result`   | `false` | Always return results as a URL (useful for consistent handling) |
| `embed_pdf_metadata` | `false` | Embed OCR metadata into returned PDF                            |

```python theme={null}
result = client.parse.run(
    input=upload.file_id,
    settings={
        "persist_results": True,
        "timeout": 120,
        "force_url_result": True
    }
)
```

<Note>
  For complete configuration reference including OCR settings, spreadsheet options, and more, see the [Configuration section](/configs/parse/ocr-settings).
</Note>

***

## Troubleshooting

<AccordionGroup>
  <Accordion title="Tables look wrong">
    Try `formatting.table_output_format: "html"`. HTML handles merged cells and complex headers better than Markdown.

    Still broken? Enable `enhance.agentic: [{"scope": "table"}]` to use an LLM for alignment fixes.
  </Accordion>

  <Accordion title="Response is slow">
    Main causes:

    * `enhance.agentic` can add latency with higher accuracy
    * `enhance.summarize_figures` adds latency with figures
    * Large documents take longer linearly
    * `async_priority` should be True for faster priority processing

    For fastest processing, disable what you don't need. See [Best Practices](/parse/best-practices).
  </Accordion>

  <Accordion title="Password-protected PDF">
    <CodeGroup>
      ```python Python theme={null}
      result = client.parse.run(
          input=upload.file_id,
          settings={"document_password": "your-password"}
      )
      ```

      ```javascript Node.js theme={null}
      const result = await client.parse.run({
        input: upload.file_id,
        settings: { document_password: 'your-password' }
      });
      ```

      ```bash cURL theme={null}
      curl -X POST https://platform.reducto.ai/parse \
        -H "Authorization: Bearer $REDUCTO_API_KEY" \
        -H "Content-Type: application/json" \
        -d '{
          "input": "reducto://your-file-id",
          "settings": {"document_password": "your-password"}
        }'
      ```
    </CodeGroup>
  </Accordion>

  <Accordion title="Response is a URL instead of content">
    Large documents return `result.type: "url"` instead of inline content to avoid HTTP size limits. Fetch the content:

    <CodeGroup>
      ```python Python theme={null}
      import requests

      if result.result.type == "url":
          chunks = requests.get(result.result.url).json()
      else:
          chunks = result.result.chunks
      ```

      ```javascript Node.js theme={null}
      let chunks;
      if (result.result.type === 'url') {
        const response = await fetch(result.result.url);
        chunks = await response.json();
      } else {
        chunks = result.result.chunks;
      }
      ```

      ```bash cURL theme={null}
      # If result.type is "url", fetch from the URL
      curl -s "$RESULT_URL" | jq '.chunks'
      ```
    </CodeGroup>

    To always get a URL (consistent handling): `settings.force_url_result: true`
  </Accordion>
</AccordionGroup>

***

## Next Steps

<CardGroup cols={2}>
  <Card title="Response Format" icon="brackets-curly" href="/parse/response-format">
    Full breakdown of chunks, blocks, and bounding boxes.
  </Card>

  <Card title="Best Practices" icon="gauge-high" href="/parse/best-practices">
    Optimization by document type, latency tips.
  </Card>
</CardGroup>
