Python
Databricks Integration
Learn how to easily incorporate Reducto into your existing Databricks workflow.
Reducto makes it easy to parse documents and send structured outputs directly within your Databricks workflows. Use our API to extract data from PDFs, forms, and more directly from your object stores—then load the results into your tables or applications for analytics, ML, RAG, or other application.
This guide utilizes our Python SDK. The easiest way for you to try this is through a Databricks notebook with the files you already have to quickly get results, but you can call our API anywhere (triggered jobs, partner tools, or through other services) before pushing your results to the DBFS or utilizing them downstream.
Install Reducto
Simple example: Parse
Parse will get the entirety of the contents of your document. Here’s a simple example where we take an existing file in our Databricks datastore, upload it, and parse it with Reducto.
You can also use a pre-signed URL for your data, for example if you have image links.
Simple example: Extract
Extract takes your documents and pulls specific data based on a schema. In this example, we take blood result documents and extract the patient name, their date of birth, and other data from their tests.
Similar to the previous example, we use the files from our Databricks datastore.
Extract with binary data
If your documents are in the form of binary data within a table, you can use this as well. This is the same example as before, but we read bytes directly from a table column.
Example: Pipeline with parse, then extract
Chaining extract calls with parse calls by passing in the job_id saves a duplicate parse call.
Write results to your table
You can write the results from your extraction directly into a structured table in Databricks, so you can slice, filter, and visualize it in notebooks.