> ## Documentation Index
> Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Parse Pipeline

> Turn documents into structured data in Studio

Parse converts documents into structured data. It identifies text blocks, tables, figures, headers, and key-value pairs, returning everything as markdown with coordinates for each element. The output is designed for downstream processing—whether that's feeding an LLM, building a search index, or piping into Extract for structured data extraction.

<Frame caption="Parse pipeline in Studio showing a financial statement with detected regions">
  <img src="https://mintcdn.com/reducto/HIMlsEXojApe7G_x/images/example_parse_pipeline.png?fit=max&auto=format&n=HIMlsEXojApe7G_x&q=85&s=e3ded297a40fd48b5e483f5e6c46a0c3" alt="Example Parse Pipeline" width="3264" height="1576" data-path="images/example_parse_pipeline.png" />
</Frame>

## What Parse extracts

Parse breaks your document into **chunks**, each representing a semantic unit of content:

* **Text blocks**: Paragraphs and body text, preserving reading order across columns
* **Tables**: Structured data with rows and columns, output as markdown, HTML, or JSON
* **Figures**: Images, charts, and diagrams with optional AI-generated descriptions
* **Headers**: Section titles with hierarchy levels for document structure
* **Key-value pairs**: Form-like content where a label maps to a value
* **Footers**: Page numbers, disclaimers, and repeated bottom-of-page content

Each chunk includes bounding box coordinates linking it back to its source location. In Studio, these appear as colored overlays on your document:

<Frame caption="Bounding boxes showing detected content types">
  <img src="https://mintcdn.com/reducto/Ya0_HQxb9L1990IV/parsed-doc-colors.png?fit=max&auto=format&n=Ya0_HQxb9L1990IV&q=85&s=734f6f1eb8eb55912df3b80d5e3bbb0c" alt="Bounding box colors" width="915" height="750" data-path="parsed-doc-colors.png" />
</Frame>

Click any box to jump to its output, or click output text to highlight its source.

## When to adjust configurations

The default configuration handles most documents well. The **Configurations** tab contains the whole list of available configurations, split into 5 buckets:

**Enhance** — Enable agentic processing when results aren't parsing correctly or text has OCR errors. Agentic adds a vision model pass to verify and correct processing results. This increases accuracy, but also cost and latency.

**Retrieval** — Configure chunking for RAG pipelines. The default may produce segments too large or small for your embedding model. Set chunking mode to `variable` with a target size around 500-1000 characters.

**Formatting** — Control output structure. Switch table format to `html` or `json` for programmatic use. Formatting also includes configurations for detecting attributes like change tracking, hyperlinks, highlights, and more.

**Spreadsheet** — Handle Excel and CSV files. Control inclusions of cell colors, formulas, dropdowns, and hidden items in your outputs, as well as table clustering detection.

**Settings** — Core processing controls such as specifying page ranges, returning images, and forcing a URL result.

See [Parse Configurations](/configs/overview) for the complete reference.

## Working with results

The Results tab shows parsed output as formatted markdown by default. The toolbar offers several options:

* **Copy** — Copy the output to your clipboard
* **Download** — Save results as a file
* **JSON** — Toggle to see the raw API response structure

The JSON view is useful for understanding exactly what your code will receive. Each chunk includes:

```json theme={null}
{
  "type": "text",
  "content": "Your extracted text here...",
  "bbox": {
    "left": 0.05,
    "top": 0.12,
    "width": 0.4,
    "height": 0.03,
    "page": 0
  }
}
```

Bounding boxes use normalized coordinates (0-1 range relative to page dimensions), making them consistent across different document sizes.

## Processing multiple files

Studio supports batch processing. Add multiple files using the **Add file** button in the file carousel, then check **All Files** before clicking Run to process the entire batch with your current configuration.

This is helpful for testing configurations across a representative sample before deploying. If results vary significantly across documents, you may need to adjust settings or consider whether a single pipeline can handle your document variety.

## From Parse to Extract

Parse alone gives you the document's content and structure. If you need specific fields—invoice totals, contract dates, patient names—add an **Extract** step. Click **Add** in the pipeline header to chain Parse → Extract, creating a multi-step pipeline you can deploy with a single Pipeline ID.

See [Extract Pipeline](/studio-extract) for schema configuration.

***

## Related

<CardGroup cols={2}>
  <Card title="Parse API" href="/parse/overview">
    API reference and response schema.
  </Card>

  <Card title="Parse Configurations" href="/configs/overview">
    All configuration options with examples.
  </Card>
</CardGroup>
