Metadata-Version: 2.4
Name: asgi-runway
Version: 0.3.1
Summary: Exposes requests-in-queue metrics for ASGI/FastAPI servers to enable accurate autoscaling
License: MIT
Keywords: asgi,autoscaling,fastapi,kubernetes,metrics,prometheus,uvicorn
Classifier: Development Status :: 3 - Alpha
Classifier: Framework :: FastAPI
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP :: HTTP Servers
Requires-Python: >=3.9
Requires-Dist: prometheus-client>=0.17
Requires-Dist: starlette>=0.20
Provides-Extra: dev
Requires-Dist: anyio[trio]>=3.6; extra == 'dev'
Requires-Dist: fastapi>=0.95; extra == 'dev'
Requires-Dist: httpx>=0.24; extra == 'dev'
Requires-Dist: locust>=2.20; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: uvicorn>=0.20; extra == 'dev'
Provides-Extra: fastapi
Requires-Dist: fastapi>=0.95; extra == 'fastapi'
Description-Content-Type: text/markdown

# asgi-runway

**Requests-in-queue metrics for ASGI/FastAPI** — the right signal for pod autoscaling.

---

## Why not CPU / Memory / RPS?

| Signal | Problem |
|---|---|
| CPU / Memory | Reactive. The pod is already overloaded before the metric crosses the threshold. |
| Requests Per Second (RPS) | Misleading. 10 req/s @ 50 ms latency = 0.5 in-flight. 10 req/s @ 5 s latency = 50 in-flight. Same RPS, wildly different load. |
| **Requests In Flight** ✓ | Directly measures backlog. Based on **Little's Law** (L = λW): combines throughput *and* latency into one number. Scale up when this exceeds your pod's target concurrency. |

## Installation

```bash
pip install asgi-runway
```

## Quick start

```python
from fastapi import FastAPI
from asgi_runway import RunwayMiddleware, metrics_router

app = FastAPI()
app.add_middleware(RunwayMiddleware)
app.include_router(metrics_router)  # exposes GET /metrics
```

Or use the one-liner:

```python
from asgi_runway import setup
setup(app)
```

## Metrics exposed

| Metric | Type | Description |
|---|---|---|
| `runway_requests_in_flight` | Gauge | **Primary autoscaling signal.** Requests currently being processed. |
| `runway_requests_in_flight_by_route` | Gauge (labelled) | Per-route-group breakdown (opt-in). |
| `runway_requests_total` | Counter | Total requests by `method` + `status`. |
| `runway_request_duration_seconds` | Histogram | Latency by `method`. |

## Per-route granularity

```python
from asgi_runway import setup

setup(
    app,
    route_groups=[
        (r"^/api/infer", "inference"),   # heavy GPU work
        (r"^/api/embed", "embedding"),   # lighter work
    ],
)
```

This populates `runway_requests_in_flight_by_route{route="inference"}` so you
can scale inference and embedding deployments independently.

## Excluding paths

Health checks and the metrics endpoint itself are excluded by default
(`/metrics`, `/healthz`, `/health`). Override with:

```python
app.add_middleware(RunwayMiddleware, exclude_paths=["/ping", "/metrics"])
```

## Finding your autoscaling threshold

The threshold is the maximum number of in-flight requests a single pod can handle before latency degrades. Setting it too high means requests pile up before scaling kicks in; too low means you over-provision.

### The formula (Little's Law)

```
threshold = target_RPS_per_pod × target_p95_latency_in_seconds
```

Example: you want each pod to serve 50 req/s with p95 < 200ms:
```
threshold = 50 × 0.2 = 10
```

### Finding it empirically with the sweep tool

The repo includes a sweep tool that steps through increasing concurrency levels, measures latency at each, and tells you where your server saturates:

```bash
pip install "asgi-runway[dev]"

# For async I/O workloads (DB queries, external API calls)
# Won't saturate — prints the Little's Law formula for your numbers
python -m examples.load_test --mode sweep

# For CPU-bound workloads (sync routes, heavy computation)
# Will saturate — prints the recommended threshold directly
python -m examples.load_test --mode sweep --workload cpu --duration-ms 200
```

Example output for a CPU-bound endpoint:

```
  conc      p50      p95      p99    throughput  status
  ──────────────────────────────────────────────────────────────
       1   0.152s   0.152s   0.152s     6.6 req/s  ✓ ok
       2   0.167s   0.215s   0.215s     9.3 req/s  ⚠ degrading
       4   0.271s   0.340s   0.340s    11.8 req/s  ✗ saturated
       8   0.358s   0.584s   0.584s    13.7 req/s  ⚠ degrading
      16   0.901s   1.184s   1.184s    13.5 req/s  ⚠ degrading

  Saturation point : ~4 concurrent requests
  Recommended autoscaling threshold : 3  (75% of saturation)
```

Saturation is detected when p95 crosses 2× its baseline. The recommended threshold is 75% of the saturation point — this gives pods time to scale up before they hit the wall (new pods take 30–60s to start).

### By workload type

| Workload | How to find threshold |
|---|---|
| **Async I/O** (DB, HTTP calls) | `target_RPS × target_p95s`. Run sweep to confirm no saturation. |
| **CPU-bound** (sync routes) | Run sweep with `--workload cpu`. Roughly equals thread pool size (`min(32, cpu_count + 4)`). |
| **ML inference** (GPU) | Usually equals your batch size × pipeline depth. Measure with sweep against your real model endpoint. |

### The right KEDA query

Don't use raw `sum(runway_requests_in_flight)` — that scales based on total load, not per-pod load. Use average:

```
sum(runway_requests_in_flight) / count(up{job="your-app"})
```

This means: "scale when the average pod is handling more than N requests", which is what you actually want.

## Kubernetes autoscaling

### KEDA (recommended)

```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: my-api-scaledobject
spec:
  scaleTargetRef:
    name: my-api
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: runway_requests_in_flight
        # Scale when the average pod exceeds 10 in-flight requests.
        # Use the per-pod average, not raw sum — see "Finding your threshold" above.
        query: sum(runway_requests_in_flight) / count(up{job="my-api"})
        threshold: "10"
```

### Kubernetes HPA with custom metrics

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  metrics:
    - type: External
      external:
        metric:
          name: runway_requests_in_flight
        target:
          type: AverageValue
          averageValue: "10"   # target 10 in-flight per pod
```

## Decoupling the metrics server

When the application is overloaded, its event loop may be saturated, causing
`/metrics` scrape requests to time out precisely when you need them most —
right before the autoscaler would fire.

The solution is to serve metrics from a server that does not share the
application's event loop. asgi-runway offers two options depending on your
deployment.

### Option A — Embedded metrics thread (plain Docker / EC2 / single container)

Pass `metrics_port` to `setup()`. A `ThreadingHTTPServer` starts in a
background daemon thread — no uvicorn, no asyncio, fully independent:

```python
from asgi_runway import setup

setup(app, metrics_port=9091, include_metrics_route=False)
```

- Prometheus scrapes port `9091`. App traffic goes to port `8000`.
- The metrics thread is isolated from the event loop, so it cannot be
  blocked by in-flight application requests.
- Works for both single-process uvicorn and multiprocess (gunicorn + uvicorn
  workers) — no shared directory required for single-process.

```
┌─────────────────────────────────────────────────┐
│  Single container                               │
│                                                 │
│  ┌──────────────────────┐                       │
│  │  uvicorn (port 8000) │  ← app traffic        │
│  │  asyncio event loop  │                       │
│  └──────────────────────┘                       │
│                                                 │
│  ┌──────────────────────┐                       │
│  │  metrics thread      │  ← Prometheus scrapes │
│  │  (port 9091)         │    this port          │
│  │  ThreadingHTTPServer │                       │
│  └──────────────────────┘                       │
└─────────────────────────────────────────────────┘
```

### Option B — Sidecar process (Docker Compose / Kubernetes / ECS)

Run the exporter as a separate container alongside the app. Both containers
share `PROMETHEUS_MULTIPROC_DIR` as a mounted volume. The exporter reads the
metric files and serves them on its own port — no code in the app server.

```bash
# Requires PROMETHEUS_MULTIPROC_DIR to be set and shared
python -m asgi_runway.exporter --port 9091
```

```
┌────────────────────────────────────────────────────────────┐
│  Pod / task                                                │
│                                                            │
│  ┌─────────────────────┐   ┌──────────────────────────┐   │
│  │  uvicorn (port 8000)│   │  runway-exporter (9091)  │   │
│  │  RunwayMiddleware   │   │  python -m               │   │
│  │  writes metric files│   │  asgi_runway.exporter    │   │
│  │  to shared volume ──┼───┼──► reads metric files    │   │
│  └─────────────────────┘   └──────────────────────────┘   │
│           │                           │                    │
│      app traffic                 Prometheus scrapes        │
└────────────────────────────────────────────────────────────┘
```

**Docker Compose:**

```yaml
version: "3"
services:
  app:
    image: my-api
    ports:
      - "8000:8000"
    environment:
      PROMETHEUS_MULTIPROC_DIR: /tmp/prom
    volumes:
      - prom_data:/tmp/prom
    command: uvicorn app:app --host 0.0.0.0 --port 8000

  runway-exporter:
    image: my-api          # same image, different entrypoint
    ports:
      - "9091:9091"
    environment:
      PROMETHEUS_MULTIPROC_DIR: /tmp/prom
    volumes:
      - prom_data:/tmp/prom
    command: python -m asgi_runway.exporter --port 9091

volumes:
  prom_data:
```

**Kubernetes sidecar container:**

```yaml
containers:
  - name: app
    image: my-api
    ports:
      - containerPort: 8000
    env:
      - name: PROMETHEUS_MULTIPROC_DIR
        value: /tmp/prom
    volumeMounts:
      - name: prom-dir
        mountPath: /tmp/prom

  - name: runway-exporter
    image: my-api
    command: ["python", "-m", "asgi_runway.exporter", "--port", "9091"]
    ports:
      - containerPort: 9091
    env:
      - name: PROMETHEUS_MULTIPROC_DIR
        value: /tmp/prom
    volumeMounts:
      - name: prom-dir
        mountPath: /tmp/prom

volumes:
  - name: prom-dir
    emptyDir: {}
```

> **Which option to use?**
> - Single container (EC2, plain Docker): use **Option A** (`metrics_port`).
> - Multiple containers (Docker Compose, Kubernetes, ECS): use **Option B**
>   (sidecar) with a shared volume, so that gunicorn workers across the pod
>   are all aggregated by the exporter.

## Multi-process mode (gunicorn + uvicorn workers)

Set the env var before starting the server — `prometheus_client` handles the rest:

```bash
export PROMETHEUS_MULTIPROC_DIR=/tmp/prometheus_multiproc
mkdir -p $PROMETHEUS_MULTIPROC_DIR
gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker
```

`runway_requests_in_flight` will be automatically summed across all workers
(`multiprocess_mode="livesum"`).

## How it works

`RunwayMiddleware` is a raw ASGI middleware (not `BaseHTTPMiddleware`, which
has known streaming issues). It wraps every request:

```
request arrives → REQUESTS_IN_FLIGHT.inc()
       ↓
  app processes
       ↓
response sent → REQUESTS_IN_FLIGHT.dec()
             → REQUESTS_TOTAL.inc()
             → REQUEST_DURATION_SECONDS.observe()
```

The `try/finally` block ensures the gauge is decremented even if the handler
raises an exception.

> **Note:** `runway_requests_in_flight` only measures requests that have
> entered the middleware. Requests dropped by the OS TCP backlog or rejected
> by uvicorn's `--limit-concurrency` are invisible to it. See
> [docs/request-limits.md](docs/request-limits.md) for the full picture,
> including recommended production values for all three layers.
