> ## Documentation Index
> Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Concurrency Throttle

> How Reducto queues parallel parse work per account, baselines by tier, and when you'll see slow P90

Reducto enforces two independent limit mechanisms. This page covers the **concurrency throttle**. For the per-second request rate caps that return `429` at the edge, see [Rate Limits](/reference/rate-limits).

| Mechanism                             | What it limits                                     | Behavior on exceeded                      | Returns            |
| ------------------------------------- | -------------------------------------------------- | ----------------------------------------- | ------------------ |
| [Rate limits](/reference/rate-limits) | Requests per second to the API                     | Request is rejected at the ingress        | `429`              |
| **Concurrency throttle**              | Parse batches running in parallel for your account | Work queues until a slot frees, then runs | `200` (after wait) |

If you submit more parse work than your account's concurrency ceiling allows, Reducto **queues** the excess rather than rejecting it. You see added latency, not 4xx.

## Your Ceiling

```
ceiling = earned_base + burst_headroom
```

* `earned_base`: capacity sized for your sustained recent traffic, starting from your tier baseline.
* `burst_headroom`: short-term slack on top of `earned_base` so a sudden spike does not immediately queue.

The unit is **concurrent batches**. Reducto splits a parse job into one or more batches, typically around 10 pages each. A 5-page document runs as a single batch. A 200-page document runs as roughly 20 concurrent batches.

## Tier Baselines

The baseline is the starting allocation for `earned_base` in the region you're hitting. With little or no recent traffic, your ceiling sits around this baseline; sustained traffic grows it above. Baselines vary per region because shared compute capacity is sized for the typical regional load.

| Tier       | US            | EU            | AU            |
| ---------- | ------------- | ------------- | ------------- |
| Standard   | 200           | 60            | 10            |
| Growth     | 350           | 120           | 20            |
| Enterprise | 500+ (custom) | 275+ (custom) | 115+ (custom) |

*All values are in concurrent batches. The actual raw cap at any moment is higher than the baseline (burst headroom on top) and scales further with sustained traffic. Enterprise baselines are negotiable upward; [contact sales](mailto:support@reducto.ai) to discuss.*

Multi-region customers get the tier's baseline in *each* region they hit.

## How Earned Capacity Grows

Submit consistent traffic and your ceiling grows above the baseline. Reducto measures your submission rate over a short trailing window and sizes your ceiling for that rate plus burst headroom. When you stop submitting, the ceiling decays back toward the baseline over the same window.

Bursty traffic gets less headroom than steady traffic at the same average rate.

## Tenant Throttling

For multi-tenant applications, you can pass `settings.tenant_throttling` on parse requests to bound how much of your account's concurrency a single one of your own customers, workspaces, or organizations can consume. Tag each request with the tenant it belongs to:

```json theme={null}
{
  "input": "https://example.com/document.pdf",
  "settings": {
    "tenant_throttling": {
      "tenant_id": "workspace_123",
      "max_share": 0.5
    }
  }
}
```

* `tenant_id` — your identifier for the tenant. Requests with the same id share one tenant-level throttle inside your account.
* `max_share` — the maximum fraction of your account's concurrency ceiling this tenant may use, between 0 (exclusive) and 1. Optional; defaults to `0.5`. You can pass different values for different tenants — for example `0.2` on a backfill tenant's requests to keep more headroom for interactive traffic.

Your account-level concurrency throttle still applies first; the tenant throttle only divides capacity inside it. If `tenant_throttling` is omitted, Reducto uses the existing account-level behavior only.

## Sync vs Async Under Throttle

Requests above your ceiling do not fail. They queue until a slot frees, then run. The wait surfaces differently depending on endpoint type:

* **Async** (`/parse_async`, `/extract_async`, `/split_async`, `/edit_async`). The job is accepted immediately, a `job_id` returned, and the work queued. Latency shows up between submission and webhook delivery, never as a 4xx.
* **Sync** (`/parse`, `/extract`, `/split`, `/edit`). The HTTP request blocks until a slot opens and the job completes. Total response time includes queue wait. The edge has a 15-minute (900s) hard timeout, so a sustained burst on sync endpoints risks the HTTP connection timing out before the job finishes.

For bursty workloads, use async with [webhooks](/workflows/async-overview). Your client doesn't hold open HTTP connections during queue wait.

```python theme={null}
import asyncio
from reducto import AsyncReducto

async def submit_burst(files: list[str]):
    client = AsyncReducto()
    jobs = await asyncio.gather(*[
        client.parse.run_job(input=f, async_={"webhook": {"mode": "svix"}})
        for f in files
    ])
    return [job.job_id for job in jobs]
```

## Related

<CardGroup cols={2}>
  <Card title="Rate Limits" icon="gauge" href="/reference/rate-limits">
    Per-second request caps at the API edge.
  </Card>

  <Card title="Async Processing" icon="clock" href="/workflows/async-overview">
    Submit jobs and receive results via webhook.
  </Card>

  <Card title="Batch Processing" icon="layer-group" href="/workflows/batch-processing">
    Patterns for processing many documents.
  </Card>
</CardGroup>
