Skip to main content
Parse converts documents into structured data. It identifies text blocks, tables, figures, headers, and key-value pairs, returning everything as markdown with coordinates for each element. The output is designed for downstream processing—whether that’s feeding an LLM, building a search index, or piping into Extract for structured data extraction.
Parse results view

Parse pipeline in Studio showing a financial statement with detected regions

What Parse extracts

Parse breaks your document into chunks, each representing a semantic unit of content:
  • Text blocks: Paragraphs and body text, preserving reading order across columns
  • Tables: Structured data with rows and columns, output as markdown, HTML, or JSON
  • Figures: Images, charts, and diagrams with optional AI-generated descriptions
  • Headers: Section titles with hierarchy levels for document structure
  • Key-value pairs: Form-like content where a label maps to a value
  • Footers: Page numbers, disclaimers, and repeated bottom-of-page content
Each chunk includes bounding box coordinates linking it back to its source location. In Studio, these appear as colored overlays on your document:
Bounding box colors

Bounding boxes showing detected content types

Click any box to jump to its output, or click output text to highlight its source.

When to adjust settings

The default configuration handles most documents well. The Configurations tab offers two modes:
Simple configuration

Simple mode exposes the most common settings

Simple mode provides quick toggles:
  • Contains Handwritten Text: Routes through OCR with AI enhancement
  • Enable AI Summarization: Generates descriptions of figures and charts
  • Return Figure/Table Images: Includes extracted images as URLs
Advanced configuration

Advanced mode exposes the full API configuration surface

Advanced mode organizes all options into collapsible sections. Here’s when to use each: Enhance — Enable agentic processing when tables aren’t extracting correctly or text has OCR errors. The AI enhancement uses vision models to verify and correct extractions. This increases accuracy but also cost and latency. Retrieval — Configure chunking for RAG pipelines. The default may produce segments too large or small for your embedding model. Set chunking mode to variable with a target size around 500-1000 characters. Formatting — Control output structure. Switch table format to html or json for programmatic use. Enable additional metadata like page numbers or confidence scores. Spreadsheet — Handle Excel and CSV files. Control multi-sheet behavior and whether to include sheet names in output. Settings — Core processing controls. Set extraction mode to ocr for scanned documents, specify page ranges to process only relevant sections. See Parse Configurations for the complete reference.

Working with results

The Results tab shows parsed output as formatted markdown by default. The toolbar offers several options:
  • Copy — Copy the output to your clipboard
  • Download — Save results as a file
  • JSON — Toggle to see the raw API response structure
The JSON view is useful for understanding exactly what your code will receive. Each chunk includes:
{
  "type": "text",
  "content": "Your extracted text here...",
  "bbox": {
    "left": 0.05,
    "top": 0.12,
    "width": 0.4,
    "height": 0.03,
    "page": 0
  }
}
Bounding boxes use normalized coordinates (0-1 range relative to page dimensions), making them consistent across different document sizes.

Processing multiple files

Studio supports batch processing. Add multiple files using the Add file button in the file carousel, then check All Files before clicking Run to process the entire batch with your current configuration. This is helpful for testing configurations across a representative sample before deploying. If results vary significantly across documents, you may need to adjust settings or consider whether a single pipeline can handle your document variety.

From Parse to Extract

Parse alone gives you the document’s content and structure. If you need specific fields—invoice totals, contract dates, patient names—add an Extract step. Click Add in the pipeline header to chain Parse → Extract, creating a multi-step pipeline you can deploy with a single Pipeline ID. See Extract Pipeline for schema configuration.