Architecture overview
The extraction pipeline consists of three primary stages:- Structural Analysis - OCR-based text detection and layout understanding
- Coordinate Extraction - Key point detection and spatial mapping
- Semantic Correspondence - Vision-language model validation and data mapping
Stage 1: Structural analysis
The process starts with OCR to detect and segment text elements—axis labels, titles, legends, and data annotations. The system analyzes the spatial positioning of these elements to understand the chart’s structure and establish coordinate system boundaries. Based on this layout analysis, we segment the chart into regions and identify the primary visualization area.- Bounding boxes for all text elements
- Axis orientation and scale detection
- Chart type classification
- Coordinate system boundaries
Stage 2: Coordinate extraction
Following structural analysis, the system performs key point detection to extract precise coordinates for data elements. This includes:- Bar heights and positions in bar charts
- Line segment endpoints and inflection points
- Pie slice boundaries and centroids
- Scatter plot point locations
Stage 3: Semantic correspondence
A fine-tuned vision-language model processes the extracted coordinates using mark prompting techniques. This step establishes the correspondence between detected key points and their associated data labels, ensuring accurate mapping between visual elements and their semantic meaning. The model handles common challenges including:- Ambiguous label-to-data associations
- Overlapping or clustered data points
- Irregular label positioning
- Multi-series data disambiguation
Adaptive processing by chart type
The pipeline adapts its processing strategy based on the detected chart type: Bar and Line Charts: The coordinate extraction and mapping stages handle most processing, with the VLM primarily validating correspondences. The system leverages the regular structure of these charts for efficient processing. Pie Charts: The vision-language model takes on additional responsibilities, directly interpreting angular relationships and percentage allocations that would be difficult to capture through coordinate analysis alone. Complex Visualizations: For stacked charts, heatmaps, and other complex formats, the pipeline dynamically adjusts the balance between rule-based extraction and model-based interpretation.Output format
The pipeline produces structured data in a tabular markdown table format. Each extraction includes:- Column headers mapped from axis labels and legends
- Row data containing the extracted values
- Metadata including chart title and data source when available