Databricks Integration

Reducto makes it easy to parse documents and send structured outputs directly within your Databricks workflows. Use our API to extract data from PDFs, forms, and more directly from your object stores—then load the results into your tables or applications for analytics, ML, RAG, or other application. This guide utilizes our Python SDK. The easiest way for you to try this is through a Databricks notebook with the files you already have to quickly get results, but you can call our API anywhere (triggered jobs, partner tools, or through other services) before pushing your results to the DBFS or utilizing them downstream. Install Reducto

!pip install reductoai

Simple example: Parse Parse will get the entirety of the contents of your document. Here’s a simple example where we take an existing file in our Databricks datastore, upload it, and parse it with Reducto. You can also use a pre-signed URL for your data, for example if you have image links.

Python

from pathlib import Path
from reducto import Reducto
import pandas as pd

api_key = dbutils.secrets.get(scope="reducto", key="REDUCTO_API_KEY") 
client = Reducto(api_key=api_key)

folder = Path("/Workspace/Users/Reducto/blood_test_results")
records = []

for blood_test in folder.iterdir():
    upload = client.upload(file=blood_test)

    parse_response = client.parse.run(input=upload)

    # We print the JSON response to see the results.
    # You can directly upload this response to your table or use it further downstream.
    print(parse_response)

Simple example: Extract Extract takes your documents and pulls specific data based on a schema. In this example, we take blood result documents and extract the patient name, their date of birth, and other data from their tests. Similar to the previous example, we use the files from our Databricks datastore.

For larger, more complex documents, a system prompt and more prompt tuning may be necessary to get the best extraction results.

from pathlib import Path
from reducto import Reducto
import pandas as pd

api_key = dbutils.secrets.get(scope="reducto", key="REDUCTO_API_KEY") 
client = Reducto(api_key=api_key)

folder = Path("/Workspace/Users/Reducto/blood_test_results")
records = []

for blood_test in folder.iterdir():
    upload = client.upload(file=blood_test)

    response = client.extract.run(
        input=upload,
        instructions={
            "system_prompt": "Be precise and thorough. These are blood test results of varying page lengths and structures. Use visual layout cues such as bold labels, column alignment, and section dividers to interpret structure.",
            "schema": {
                "type": "object",
                "properties": {
                    "patientName": {
                        "type": "string",
                        "description": "The full name of the patient."
                    },
                    "dateOfBirth": {
                        "type": "string",
                        "description": "The date of birth of the patient, formatted as YYYY-MM-DD."
                    },
                    "hemoglobinCount": {
                        "type": "number",
                        "description": "The hemoglobin count in the patient's blood, measured in grams per deciliter."
                    },
                    "redBloodCellCount": {
                        "type": "number",
                        "description": "The count of red blood cells in the patient's blood."
                    },
                    "whiteBloodCellCount": {
                        "type": "number",
                        "description": "The count of white blood cells in the patient's blood."
                    }
                },
                "required": [
                    "patientName",
                    "dateOfBirth",
                    "hemoglobinCount",
                    "redBloodCellCount",
                    "whiteBloodCellCount"
                ]
            }
        }
    )
    if response.result:
        records.extend(response.result)

Extract with binary data If your documents are in the form of binary data within a table, you can use this as well. This is the same example as before, but we read bytes directly from a table column.

import tempfile, os
from pathlib import Path
from reducto import Reducto
import pandas as pd

api_key = dbutils.secrets.get(scope="reducto", key="REDUCTO_API_KEY")
client = Reducto(api_key=api_key)

records = []

spark_df = spark.read.table("lab_results.raw_files") 
rows = spark_df.collect()  

for row in rows:
    file_name = row["file_name"]  
    bytes_ = row["contents"] # binary column

    # Write bytes to a disposable temp file so Reducto can read it
    suffix = Path(file_name).suffix or ".jpg"
    with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
        tmp.write(bytes_)
        tmp_path = Path(tmp.name)

    # Everything here on is identical; upload and extract.
    upload = client.upload(file=tmp_path)

    response = client.extract.run(
        input=upload,
        instructions={
            "system_prompt": "Be precise and thorough. These are blood test results of varying page lengths and structures. Use visual layout cues such as bold labels, column alignment, and section dividers to interpret structure.",
            "schema": {
                "type": "object",
                "properties": {
                    "patientName": {
                        "type": "string",
                        "description": "The full name of the patient."
                    },
                    "dateOfBirth": {
                        "type": "string",
                        "description": "The date of birth of the patient, formatted as YYYY-MM-DD."
                    },
                    "hemoglobinCount": {
                        "type": "number",
                        "description": "The hemoglobin count in the patient's blood, measured in grams per deciliter."
                    },
                    "redBloodCellCount": {
                        "type": "number",
                        "description": "The count of red blood cells in the patient's blood."
                    },
                    "whiteBloodCellCount": {
                        "type": "number",
                        "description": "The count of white blood cells in the patient's blood."
                    }
                },
                "required": [
                    "patientName",
                    "dateOfBirth",
                    "hemoglobinCount",
                    "redBloodCellCount",
                    "whiteBloodCellCount"
                ]
            }
        }
    )

    # Remove the temp file
    os.remove(tmp_path)

Example: Pipeline with parse, then extract Chaining extract calls with parse calls by passing in the job_id saves a duplicate parse call.

Not re-running the parse job again on the same document will save 50% on the extract call, too.

from pathlib import Path
from reducto import Reducto
import pandas as pd

api_key = dbutils.secrets.get(scope="reducto", key="REDUCTO_API_KEY") 
client = Reducto(api_key=api_key)

folder = Path("/Workspace/Users/Reducto/blood_test_results")
records = []

for blood_test in folder.iterdir():
    upload = client.upload(file=blood_test)

    parse_response = client.parse.run(
        input=upload,
        retrieval={
            "chunking": {"chunk_mode": "variable"},
            "embedding_optimized": False
        },
        enhance={
            "summarize_figures": False
        }
    )
    # You can use this fully parsed response for something if needed!

    job_id = parse_response.job_id

    response = client.extract.run(
        input=f"jobid://{job_id}",
        instructions={
            "system_prompt": "Be precise and thorough. These are blood test results of varying page lengths and structures. Use visual layout cues such as bold labels, column alignment, and section dividers to interpret structure.",
            "schema": {
                "type": "object",
                "properties": {
                    "patientName": {
                        "type": "string",
                        "description": "The full name of the patient."
                    },
                    "dateOfBirth": {
                        "type": "string",
                        "description": "The date of birth of the patient, formatted as YYYY-MM-DD."
                    },
                    "hemoglobinCount": {
                        "type": "number",
                        "description": "The hemoglobin count in the patient's blood, measured in grams per deciliter."
                    },
                    "redBloodCellCount": {
                        "type": "number",
                        "description": "The count of red blood cells in the patient's blood."
                    },
                    "whiteBloodCellCount": {
                        "type": "number",
                        "description": "The count of white blood cells in the patient's blood."
                    }
                },
                "required": [
                    "patientName",
                    "dateOfBirth",
                    "hemoglobinCount",
                    "redBloodCellCount",
                    "whiteBloodCellCount"
                ]
            }
        }
    )

Write results to your table You can write the results from your extraction directly into a structured table in Databricks, so you can slice, filter, and visualize it in notebooks.

dataframe = pd.DataFrame(records)
spark_df = spark.createDataFrame(dataframe)

spark.sql("CREATE DATABASE IF NOT EXISTS lab_results")
(
    spark_df
    .write
    .mode("append")
    .saveAsTable("lab_results.blood_test_results")
)

print("Loaded", len(records), "records into lab_results.blood_test_results")
display(spark_df)

Conclusion

Integrating Reducto with Databricks gives teams a powerful way to unlock the value hidden in unstructured documents. Whether you’re dealing with medical records, contracts, invoices, or scanned forms, Reducto transforms them into structured, machine-readable data—ready for analytics, AI, or workflow automation. With seamless compatibility with Databricks notebooks, workflows, and tables, you can move from raw document to clean dataset in just a few lines of code.

Get Started

Migration

Examples

Core Functions

Configurations

FAQ

Security and privacy

On-premise deployment

Databricks Integration

Conclusion

Get Started

Migration

Examples

Core Functions

Configurations

FAQ

Security and privacy

On-premise deployment

​Conclusion

Conclusion