Skip to main content
Our chart extraction feature uses a multi-stage pipeline that combines OCR with vision-language models to extract structured data from chart images. The system processes diverse chart types through adaptive workflows, delivering accurate data extraction for enterprise analytics workflows.

Architecture overview

The extraction pipeline consists of three primary stages:
  1. Structural Analysis - OCR-based text detection and layout understanding
  2. Coordinate Extraction - Key point detection and spatial mapping
  3. Semantic Correspondence - Vision-language model validation and data mapping
Chart Extraction Pipeline Architecture

Stage 1: Structural analysis

The process starts with OCR to detect and segment text elements—axis labels, titles, legends, and data annotations. The system analyzes the spatial positioning of these elements to understand the chart’s structure and establish coordinate system boundaries. Based on this layout analysis, we segment the chart into regions and identify the primary visualization area. OCR Text Detection and Segmentation Key outputs from this stage include:
  • Bounding boxes for all text elements
  • Axis orientation and scale detection
  • Chart type classification
  • Coordinate system boundaries

Stage 2: Coordinate extraction

Following structural analysis, the system performs key point detection to extract precise coordinates for data elements. This includes:
  • Bar heights and positions in bar charts
  • Line segment endpoints and inflection points
  • Pie slice boundaries and centroids
  • Scatter plot point locations
Key Point Detection and Coordinate Extraction The coordinate extraction module uses computer vision techniques to identify these elements within the segmented regions established in Stage 1. Each detected point is stored with its pixel coordinates and confidence score.

Stage 3: Semantic correspondence

A fine-tuned vision-language model processes the extracted coordinates using mark prompting techniques. This step establishes the correspondence between detected key points and their associated data labels, ensuring accurate mapping between visual elements and their semantic meaning. The model handles common challenges including:
  • Ambiguous label-to-data associations
  • Overlapping or clustered data points
  • Irregular label positioning
  • Multi-series data disambiguation
After validation, the system transforms the coordinates from pixel space to actual data values using the established axis scales from Stage 1.

Adaptive processing by chart type

The pipeline adapts its processing strategy based on the detected chart type: Bar and Line Charts: The coordinate extraction and mapping stages handle most processing, with the VLM primarily validating correspondences. The system leverages the regular structure of these charts for efficient processing. Pie Charts: The vision-language model takes on additional responsibilities, directly interpreting angular relationships and percentage allocations that would be difficult to capture through coordinate analysis alone. Complex Visualizations: For stacked charts, heatmaps, and other complex formats, the pipeline dynamically adjusts the balance between rule-based extraction and model-based interpretation.

Output format

The pipeline produces structured data in a tabular markdown table format. Each extraction includes:
  • Column headers mapped from axis labels and legends
  • Row data containing the extracted values
  • Metadata including chart title and data source when available

Performance characteristics

The system maintains high accuracy across diverse chart formats encountered in enterprise environments. Processing time varies by chart complexity. The pipeline includes built-in validation steps to ensure data quality and consistency across the extraction process.
I