When to use batch processing
| Approach | Best for |
|---|---|
| Batch processing (this page) | Processing many documents, need results immediately |
| Webhooks | Fire-and-forget, long documents, notification when done |
| Sequential | Simple scripts, debugging, rate-limited scenarios |
Async Python (recommended)
UseAsyncReducto with asyncio for the best performance. The semaphore controls concurrency to avoid overwhelming the API.
Processing URLs
If your documents are already hosted (S3, web server, etc.), process URLs directly:Processing local files
For local files, upload first then parse:With progress bar
With error handling
Some documents may fail (corrupt files, unsupported formats). Handle errors gracefully to avoid losing all results:Sync Python with threading
If you can’t use async, useThreadPoolExecutor with the synchronous client:
Batch extraction
The same patterns work for extraction. Define your schema once and apply it to all documents:JavaScript / TypeScript
Saving results
Save results as you process to avoid losing work:Concurrency limits
| Method | Recommended concurrency |
|---|---|
AsyncReducto | 50-200 concurrent requests |
ThreadPoolExecutor | 10-50 workers |
run_job() (webhooks) | Unlimited |
What about cURL?
Batch processing requires programming constructs (loops, concurrency control, error handling) that aren’t practical in cURL. For single-document processing via cURL, see the API reference. For batch workflows without writing code, consider:- Reducto CLI for scripting
- Studio pipelines for visual configuration
Best practices
- Use async when possible:
AsyncReductois more efficient than threading - Handle errors gracefully: Don’t let one failure stop the entire batch
- Save incrementally: Write results to disk as they complete
- Monitor progress: Use
tqdmor logging to track progress - Set reasonable concurrency: Start low (20-50) and increase if stable