Metadata-Version: 2.4
Name: aiperf
Version: 0.5.0
Summary: AIPerf is a package for performance testing of AI models
Author-email: "NVIDIA Inc." <sw-dl-dynamo@nvidia.com>
License: Apache-2.0
License-File: ATTRIBUTIONS.md
License-File: LICENSE
Requires-Python: >=3.10
Requires-Dist: aiofiles~=24.1.0
Requires-Dist: aiohttp~=3.13.3
Requires-Dist: async-timeout>=4.0.0; python_version < '3.11'
Requires-Dist: billiard>=4.2.0
Requires-Dist: cyclopts<5,>=4
Requires-Dist: dash-bootstrap-components~=2.0.0
Requires-Dist: dash~=3.1.0
Requires-Dist: ffmpeg-python~=0.2.0
Requires-Dist: jinja2~=3.1.5
Requires-Dist: jmespath~=1.0.1
Requires-Dist: kaleido~=1.2.0
Requires-Dist: matplotlib>=3.10.0
Requires-Dist: msgspec<1.0.0,>=0.19.0
Requires-Dist: numpy~=1.26.4
Requires-Dist: orjson~=3.10.18
Requires-Dist: pandas~=2.3.3
Requires-Dist: pillow~=11.1.0
Requires-Dist: plotly~=6.4.0
Requires-Dist: prometheus-client~=0.23.1
Requires-Dist: psutil~=7.0.0
Requires-Dist: pyarrow>=18.0.0
Requires-Dist: pydantic-settings<3.0.0,>=2.10.0
Requires-Dist: pydantic<3.0.0,>=2.10.0
Requires-Dist: pyzmq~=26.4.0
Requires-Dist: rich~=14.1.0
Requires-Dist: ruamel-yaml~=0.18.12
Requires-Dist: seaborn~=0.13.2
Requires-Dist: setproctitle~=1.3.6
Requires-Dist: soundfile~=0.13.1
Requires-Dist: textual~=5.3.0
Requires-Dist: tqdm>=4.67.1
Requires-Dist: transformers>=4.56.0
Requires-Dist: uvloop~=0.21.0; platform_system != 'Windows'
Provides-Extra: dev
Requires-Dist: black>=25.1.0; extra == 'dev'
Requires-Dist: httpx>=0.27.0; extra == 'dev'
Requires-Dist: looptime>=0.5; extra == 'dev'
Requires-Dist: mkinit>=1.1.0; extra == 'dev'
Requires-Dist: pre-commit>=4.2.0; extra == 'dev'
Requires-Dist: pytest-asyncio; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest-xdist>=3.8.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.0.0; extra == 'dev'
Requires-Dist: scipy>=1.13.0; extra == 'dev'
Requires-Dist: trustme>=1.0.0; extra == 'dev'
Description-Content-Type: text/markdown

<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# AIPerf

[![PyPI version](https://img.shields.io/pypi/v/AIPerf)](https://pypi.org/project/aiperf/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Codecov](https://codecov.io/gh/ai-dynamo/aiperf/graph/badge.svg)](https://codecov.io/gh/ai-dynamo/aiperf)
[![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/D92uqZRjCZ)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/aiperf)


**[Architecture](docs/architecture.md)**| **[Design Proposals](https://github.com/ai-dynamo/enhancements)** | **[Migrating from Genai-Perf](docs/migrating.md)** | **[CLI Options](docs/cli_options.md)**


AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution.
It provides detailed metrics using a command line display as well as extensive benchmark performance reports.

AIPerf provides multiprocess support out of the box for a single scalable solution.


<!--
======================
Features
======================
-->

<img width="1724" height="670" alt="AIPerf UI Dashboard" src="https://github.com/user-attachments/assets/7eb40867-b1c1-4ebe-bd57-7619f2154bba" />

## Features

- Scalable via multiprocess support
- Modular design for easy user modification
- Several benchmarking modes:
  - concurrency
  - request-rate
  - [request-rate with a maximum concurrency](docs/tutorials/request-rate-concurrency.md)
  - [trace replay](docs/benchmark_modes/trace_replay.md)
- [Public dataset support](docs/benchmark_datasets.md)

</br>

## Tutorials & Advanced Features

### Getting Started
- **[Basic Tutorial](docs/tutorial.md)** - Learn the fundamentals with Dynamo and vLLM examples

### Load Control & Timing

| Feature | Description | Use Cases |
|---------|-------------|-----------|
| **[Request Rate with Max Concurrency](docs/tutorials/request-rate-concurrency.md)** | Dual control of request timing and concurrent connection ceiling (Poisson or constant modes) | Testing API rate/concurrency limits, avoiding thundering herd, realistic client simulation |
| **[Arrival Patterns](docs/tutorials/arrival-patterns.md)** | Configure traffic patterns (constant, Poisson, gamma) with tunable burstiness | Realistic traffic simulation, stress testing, vLLM-compatible benchmarks |
| **[Prefill Concurrency](docs/tutorials/prefill-concurrency.md)** | Limit concurrent prefill operations to prevent memory exhaustion with long-context workloads | Long-context benchmarking, OOM prevention, memory-safe stress testing |
| **[Gradual Ramping](docs/tutorials/ramping.md)** | Smooth ramp-up of concurrency and request rate over time | Capacity discovery, avoiding cold-start spikes, server warm-up |
| **[Warmup Phase](docs/tutorials/warmup.md)** | Configure pre-benchmark warmup to eliminate cold-start effects | Accurate measurements, JIT warm-up, cache priming |
| **[User-Centric Timing](docs/tutorials/user-centric-timing.md)** | Per-user rate limiting with precise timing for KV cache benchmarking | KV cache effectiveness, multi-user simulation, cache TTL testing |
| **[Request Cancellation](docs/tutorials/request-cancellation.md)** | Test timeout behavior and service resilience | SLA validation, cancellation modeling |

### Workloads & Data

| Feature | Description | Use Cases |
|---------|-------------|-----------|
| **[Trace Benchmarking](docs/tutorials/trace-benchmarking.md)** | Deterministic workload replay with custom datasets | Regression testing, A/B testing |
| **[Custom Prompt Benchmarking](docs/tutorials/custom-prompt-benchmarking.md)** | Send each prompt from your file exactly as-is, without sampling or generation | Regression testing, A/B testing, debugging specific prompts |
| **[Fixed Schedule](docs/tutorials/fixed-schedule.md)** | Precise timestamp-based request execution | Traffic replay, temporal analysis, burst testing |
| **[Time-based Benchmarking](docs/tutorials/time-based-benchmarking.md)** | Duration-based testing with grace period control | Stability testing, sustained performance |
| **[Sequence Distributions](docs/tutorials/sequence-distributions.md)** | Mixed ISL/OSL pairings | Benchmarking mixed use cases |
| **[Random Number Generation & Reproducibility](docs/reproducibility.md)** | Deterministic dataset generation with `--random-seed` | Debugging, regression testing, controlled experiments |
| **[Template Endpoint](docs/tutorials/template-endpoint.md)** | Benchmark custom APIs with flexible Jinja2 request templates | Custom API formats, rapid prototyping, non-standard endpoints |
| **[SGLang Image Generation](docs/tutorials/sglang-image-generation.md)** | Benchmark image generation APIs using SGLang with FLUX.1-dev model | Image generation testing, text-to-image benchmarking, extracting generated images |

### Analysis & Monitoring

| Feature | Description | Use Cases |
|---------|-------------|-----------|
| **[Timeslice Metrics](docs/tutorials/timeslices.md)** | Split up benchmark into timeslices and calculate metrics for each timeslice | Load pattern impact, detecting warm-up effects, performance degradation analysis |
| **[Goodput](docs/tutorials/goodput.md)** | Throughput of requests meeting user-defined SLOs | SLO validation, capacity planning, runtime/model comparisons |
| **[HTTP Trace Metrics](docs/tutorials/http-trace-metrics.md)** | Detailed HTTP request lifecycle timing (DNS, TCP/TLS, TTFB) following k6 and HAR conventions | Connection debugging, latency breakdown, transport-layer analysis |
| **[Profile Exports](docs/tutorials/working-with-profile-exports.md)** | Parse and analyze `profile_export.jsonl` with Pydantic models, custom metrics, and async processing | Custom analysis, data pipelines, post-processing |
| **[Visualization & Plotting](docs/tutorials/plot.md)** | Generate PNG visualizations with automatic mode detection (single-run analysis or multi-run comparison) | Parameter sweep analysis, performance debugging, model comparison |
| **[GPU Telemetry](docs/tutorials/gpu-telemetry.md)** | Real-time GPU metrics collection via DCGM (power, utilization, memory, temperature, etc) | Performance optimization, resource monitoring, multi-node telemetry |
| **[Server Metrics](docs/server_metrics/server-metrics.md)** | Collect Prometheus-compatible server metrics during benchmarking | Performance optimization, resource monitoring, multi-node telemetry |

### Quick Navigation
```bash
# Basic profiling
aiperf profile --model Qwen/Qwen3-0.6B --url localhost:8000 --endpoint-type chat

# Request timeout testing
aiperf profile --request-timeout-seconds 30.0 [other options...]

# Trace-based benchmarking
aiperf profile --input-file trace.jsonl --custom-dataset-type single_turn [other options...]

# Fixed schedule execution
aiperf profile --input-file schedule.jsonl --fixed-schedule --fixed-schedule-auto-offset [other options...]

# Time-based benchmarking
aiperf profile --benchmark-duration 300.0 --benchmark-grace-period 30.0 [other options...]
```

</br>

## Supported APIs

- OpenAI chat completions
- OpenAI completions
- OpenAI embeddings
- OpenAI audio: request throughput and latency
- OpenAI images: request throughput and latency
- NIM rankings

</br>


<!--
======================
INSTALLATION
======================
-->

## Installation
```
pip install aiperf
```

</br>

<!--
======================
QUICK START
======================
-->

## Quick Start

### Basic Usage

Run a simple benchmark against a model:

```bash
aiperf profile \
  --model your_model_name \
  --url http://localhost:8000 \
  --endpoint-type chat \
  --streaming
```

### Example with Custom Configuration

```bash
aiperf profile \
  --model Qwen/Qwen3-0.6B \
  --url http://localhost:8000 \
  --endpoint-type chat \
  --concurrency 10 \
  --request-count 100 \
  --streaming
```

Example output:
<div align="center">

```
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃                               Metric ┃       avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p75 ┃   std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│             Time to First Token (ms) │     18.26 │  11.22 │ 106.32 │  68.82 │  27.76 │  16.62 │ 12.07 │
│            Time to Second Token (ms) │     11.40 │   0.02 │  85.91 │  34.54 │  12.59 │  11.65 │  7.01 │
│                 Request Latency (ms) │    487.30 │ 267.07 │ 769.57 │ 715.99 │ 580.83 │ 536.17 │ 79.60 │
│             Inter Token Latency (ms) │     11.23 │   8.80 │  13.17 │  12.48 │  11.73 │  11.37 │  0.45 │
│     Output Token Throughput Per User │     89.23 │  75.93 │ 113.60 │ 102.28 │  90.91 │  90.29 │  3.70 │
│                    (tokens/sec/user) │           │        │        │        │        │        │       │
│      Output Sequence Length (tokens) │     42.83 │  24.00 │  65.00 │  64.00 │  52.00 │  47.00 │  7.21 │
│       Input Sequence Length (tokens) │     10.00 │  10.00 │  10.00 │  10.00 │  10.00 │  10.00 │  0.00 │
│ Output Token Throughput (tokens/sec) │ 10,944.03 │    N/A │    N/A │    N/A │    N/A │    N/A │   N/A │
│    Request Throughput (requests/sec) │    255.54 │    N/A │    N/A │    N/A │    N/A │    N/A │   N/A │
│             Request Count (requests) │    711.00 │    N/A │    N/A │    N/A │    N/A │    N/A │   N/A │
└──────────────────────────────────────┴───────────┴────────┴────────┴────────┴────────┴────────┴───────┘
```
</div>


## Known Issues

- Output sequence length constraints (`--output-tokens-mean`) cannot be guaranteed unless you pass `ignore_eos` and/or `min_tokens` via `--extra-inputs` to an inference server that supports them.
- Very high concurrency settings (typically >15,000 concurrency) may lead to port exhaustion on some systems, causing connection failures during benchmarking. If encountered, consider adjusting system limits or reducing concurrency.
- Startup errors caused by invalid configuration settings can cause AIPerf to hang indefinitely. If AIPerf appears to freeze during initialization, terminate the process and check configuration settings.

