Multi-lingual Document Processing

Process documents in 60+ languages with automatic language detection. No configuration required.

Sample Document

Download the sample: un-document-spanish.pdf

Supported Languages

Reducto automatically detects and processes these languages:

View all 60+ supported languages

Region	Languages
European	English, German, French, Spanish, Portuguese, Italian, Dutch, Polish, Romanian, Czech, Greek, Hungarian, Swedish, Danish, Finnish, Norwegian, Bulgarian, Croatian, Slovak, Slovenian, Lithuanian, Latvian, Estonian, Albanian, Icelandic, Catalan, Serbian, Macedonian, Belarusian, Ukrainian
Asian	Chinese, Japanese, Korean, Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Thai, Vietnamese, Indonesian, Malay, Filipino/Tagalog, Khmer, Lao, Nepali
Middle Eastern	Arabic, Hebrew, Persian, Turkish
Other	Russian, Armenian, Yiddish, Afrikaans

The standard OCR handles mixed-language documents automatically. A single document can contain text in multiple languages without any special configuration.

Create API Key

Open Studio

Go to studio.reducto.ai and sign in. From the home page, click API Keys in the left sidebar.

Studio home page with API Keys in sidebar

View API Keys

The API Keys page shows your existing keys. Click + Create new API key in the top right corner.

Configure Key

In the modal, enter a name for your key and set an expiration policy (or select “Never” for no expiration). Click Create.

New API Key modal with name and expiration fields

Copy Your Key

Copy your new API key and store it securely. You won’t be able to see it again after closing this dialog.

Set the key as an environment variable:

export REDUCTO_API_KEY="your-api-key-here"

Studio Walkthrough

Upload and Configure OCR

Upload your multilingual document to studio.reducto.ai. In the Parse view, open the Configurations tab to see OCR settings.

Parse view with Spanish UN document showing OCR settings

Key settings:

Extraction Mode: Use ocr for scanned documents where text is embedded as images. Use hybrid (default) for mixed documents where some pages are native text and others are scans.
OCR System: Keep standard (default) for 60+ language support. The legacy system only supports Germanic languages.

View Extracted Text

Click Run and switch to the Results tab. Reducto extracts text in the original language with proper character encoding.

Parse results showing extracted Spanish text from UN document

Notice how the Spanish text is extracted accurately, including accented characters (á, é, í, ó, ú, ñ) and proper formatting.

Processing Non-English Documents

Basic Usage

No special configuration needed - just parse as usual:

from reducto import Reducto

client = Reducto()

# Upload Spanish document
with open("documento_español.pdf", "rb") as f:
    upload = client.upload(file=f)

# Parse - language is detected automatically
result = client.parse.run(input=upload.file_id)

# Access extracted text
for chunk in result.result.chunks:
    print(chunk.content)

Output Example

From a Spanish UN Security Council document:

Naciones Unidas
S/2025/856

Consejo de Seguridad

Distr. general
30 de diciembre de 2025
Español
Original: inglés

Carta de fecha 29 de diciembre de 2025 dirigida a la
Presidencia del Consejo de Seguridad por el Secretario General

Tengo el honor de referirme a la resolución 2719 (2023) del Consejo de
Seguridad, por la que el Consejo estableció el marco para financiar las
operaciones de paz...

OCR Configuration Options

Extraction Modes

Choose the right mode for your document type:

# For scanned documents (images, old PDFs)
result = client.parse.run(
    input=upload.file_id,
    settings={
        "extraction_mode": "ocr"  # Force OCR, ignore embedded text
    }
)

# For native PDFs with embedded text
result = client.parse.run(
    input=upload.file_id,
    settings={
        "extraction_mode": "metadata"  # Use embedded text only
    }
)

# For mixed documents (default)
result = client.parse.run(
    input=upload.file_id,
    settings={
        "extraction_mode": "hybrid"  # Use metadata first, OCR as fallback
    }
)

Mode	Best For	Speed	Accuracy
`hybrid`	Mixed document sets	Fast	High
`ocr`	Scanned documents	Slower	High
`metadata`	Native PDFs	Fastest	Depends on PDF quality

OCR System Selection

Always use standard for multilingual support:

result = client.parse.run(
    input=upload.file_id,
    settings={
        "ocr_system": "standard"  # 60+ languages (default)
    }
)

The legacy OCR system only supports Germanic languages (English, German, Dutch, etc.). Always use standard for non-Germanic languages.

Mixed-Language Documents

Documents containing multiple languages are handled automatically:

# A document with English headers and Spanish content
result = client.parse.run(input=upload.file_id)

# Both languages are extracted correctly
# No configuration needed

Example: Bilingual Contract

AGREEMENT / ACUERDO

This agreement ("Agreement") is entered into between...
Este acuerdo ("Acuerdo") se celebra entre...

TERMS AND CONDITIONS / TÉRMINOS Y CONDICIONES
1. Definitions / Definiciones
   The following terms shall have the meanings set forth below...
   Los siguientes términos tendrán los significados establecidos a continuación...

Reducto extracts both English and Spanish text accurately.

Agentic Mode for Difficult Text

Standard OCR works well for clean, printed documents. For challenging documents like handwriting, faded text, or unusual fonts, agentic mode uses a vision language model to verify and correct OCR output.

result = client.parse.run(
    input=upload.file_id,
    enhance={
        "agentic": [{"scope": "text"}]
    }
)

Use agentic mode when:

Text is handwritten or uses decorative fonts
Document is faded, stained, or low quality
OCR produces garbled output on first pass

Agentic mode costs approximately 2x credits. Use it selectively for documents where standard OCR struggles.

Extracting Structured Data

Extract structured data from non-English documents using schemas with descriptive field hints:

# Schema for Spanish invoice
spanish_invoice_schema = {
    "type": "object",
    "properties": {
        "numero_factura": {
            "type": "string",
            "description": "Número de factura / Invoice number"
        },
        "fecha": {
            "type": "string",
            "description": "Fecha de la factura / Invoice date"
        },
        "proveedor": {
            "type": "object",
            "description": "Información del proveedor / Vendor information",
            "properties": {
                "nombre": {"type": "string"},
                "direccion": {"type": "string"},
                "nif": {"type": "string", "description": "NIF/CIF fiscal ID"}
            }
        },
        "cliente": {
            "type": "object",
            "description": "Información del cliente / Customer information",
            "properties": {
                "nombre": {"type": "string"},
                "direccion": {"type": "string"}
            }
        },
        "lineas": {
            "type": "array",
            "description": "Líneas de factura / Line items",
            "items": {
                "type": "object",
                "properties": {
                    "descripcion": {"type": "string"},
                    "cantidad": {"type": "number"},
                    "precio_unitario": {"type": "number"},
                    "importe": {"type": "number"}
                }
            }
        },
        "subtotal": {"type": "number"},
        "iva": {"type": "number", "description": "IVA / VAT amount"},
        "total": {"type": "number"}
    }
}

result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": spanish_invoice_schema}
)

print(f"Factura: {result.result['numero_factura']}")
print(f"Total: €{result.result['total']}")

Include field descriptions in both the source language and English to improve extraction accuracy.

Tips

For best results with non-English documents:

Use high-quality scans (300 DPI minimum) for better OCR accuracy
Enable agentic mode for handwritten or degraded text
Provide bilingual field descriptions in extraction schemas to improve accuracy
Use extraction_mode: "ocr" for scanned documents instead of relying on embedded text

Next Steps

OCR Settings

Full OCR configuration reference

Agentic Modes

AI-enhanced text correction

Batch Processing

Process many documents at scale

Extract API

Structured data extraction

Get Started

Pipelines & Workflows

Document Examples

Multi-lingual Document Processing

Sample Document

Supported Languages

Create API Key

Studio Walkthrough

Processing Non-English Documents

Basic Usage

Output Example

OCR Configuration Options

Extraction Modes

OCR System Selection

Mixed-Language Documents

Example: Bilingual Contract

Agentic Mode for Difficult Text

Extracting Structured Data

Tips

Next Steps

OCR Settings

Agentic Modes

Batch Processing

Extract API

Get Started

Pipelines & Workflows

Document Examples

​Sample Document

​Supported Languages

​Create API Key

​Studio Walkthrough

​Processing Non-English Documents

​Basic Usage

​Output Example

​OCR Configuration Options

​Extraction Modes

​OCR System Selection

​Mixed-Language Documents

​Example: Bilingual Contract

​Agentic Mode for Difficult Text

​Extracting Structured Data

​Tips

​Next Steps

OCR Settings

Agentic Modes

Batch Processing

Extract API

Sample Document

Supported Languages

Create API Key

Studio Walkthrough

Processing Non-English Documents

Basic Usage

Output Example

OCR Configuration Options

Extraction Modes

OCR System Selection

Mixed-Language Documents

Example: Bilingual Contract

Agentic Mode for Difficult Text

Extracting Structured Data

Tips

Next Steps