Metadata-Version: 2.4
Name: stressllm
Version: 0.1.1
Summary: Find the breaking point of your local LLM hardware.
Author: Vignesh
License: MIT
Project-URL: Homepage, https://github.com/iam-vignesh/stressllm
Project-URL: Repository, https://github.com/iam-vignesh/stressllm
Project-URL: Issues, https://github.com/iam-vignesh/stressllm/issues
Keywords: llm,benchmark,stress-test,gpu,ollama,performance
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: System :: Benchmark
Classifier: Topic :: System :: Hardware
Classifier: Typing :: Typed
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: ollama>=0.1.0
Requires-Dist: typer[all]>=0.12.0
Requires-Dist: rich>=13.0.0
Requires-Dist: psutil>=5.9.0
Provides-Extra: gpu
Requires-Dist: nvidia-ml-py>=12.0.0; extra == "gpu"
Provides-Extra: gguf
Requires-Dist: llama-cpp-python>=0.2.0; extra == "gguf"
Provides-Extra: all
Requires-Dist: nvidia-ml-py>=12.0.0; extra == "all"
Requires-Dist: llama-cpp-python>=0.2.0; extra == "all"
Provides-Extra: dev
Requires-Dist: ruff>=0.4.0; extra == "dev"
Requires-Dist: pytest>=8.0.0; extra == "dev"
Dynamic: license-file

# ⚡ stressllm

**Find the breaking point of your local LLM hardware.**

stressllm is a CLI benchmarking tool that finds the "Performance Cliff" of your local setup. It progressively grows the context window and measures tokens-per-second, latency, VRAM usage, GPU temperature, and RAM pressure — then tells you exactly where your hardware gives up.

## Quick Start

```bash
pip install stressllm

# Stress test a model via Ollama
stressllm run gemma2 --depth 3

# Check your hardware and dependencies
stressllm info
```

## Prerequisites

| Requirement | Required? | Notes |
|-------------|-----------|-------|
| Python 3.9+ | Yes | |
| [Ollama](https://ollama.com/download) | Yes (for `run`) | Must be running: `ollama serve` |
| NVIDIA GPU + drivers | Optional | Enables VRAM and temperature monitoring |
| llama-cpp-python | Optional | Only needed for `check` command |

stressllm checks for Ollama on startup and will tell you exactly what's missing if something isn't right.

## Installation

```bash
# Basic install (Ollama stress testing)
pip install stressllm

# With GPU monitoring
pip install stressllm[gpu]

# With direct .gguf file analysis
pip install stressllm[gguf]

# Everything
pip install stressllm[all]
```

For development:
```bash
git clone https://github.com/iam-vignesh/stressllm
cd stressllm
pip install -e ".[all]"
```

## Usage

### `stressllm run` — Stress test via Ollama

```bash
stressllm run gemma2 --depth 3
```

Progressively fills the context window (2k → 8k → 32k → ...) and measures performance at each step.

| Option | Default | Description |
|--------|---------|-------------|
| `--depth` | 3 | Context steps (1–5). Higher = larger contexts tested. |
| `--timeout` | 300 | Max seconds per context step. 0 = no limit. |
| `--verbose` | off | Show detected hardware and dependency info before the test. |
| `--json` | off | Output results as JSON for scripting and CI. |


Example output:
```
╭─────────────────────────────────────────────────────╮
│  ⚡ stressllm — Stress Testing: gemma2              │
│  NVIDIA RTX 4090 · 24GB VRAM · 64GB RAM             │
╰─────────────────────────────────────────────────────╯

 Context   TPS     TTFT      VRAM     GPU Temp   RAM     Status
 ───────   ─────   ──────    ──────   ────────   ─────   ──────
 2k        45.2    120ms     34.2%    52°C       41%     ✅ Smooth
 8k        38.7    340ms     58.1%    61°C       43%     ✅ Smooth
 32k       12.1    1.4s      89.3%    74°C       52%     ⚠️  Slowing
 128k      2.3     8.2s      97.8%    82°C       68%     💀 Cliff

╭─────────────────────────────────────────────────────╮
│  Verdict: gemma2 runs well up to 8k context.        │
│  Performance cliff detected at 32k.                 │
╰─────────────────────────────────────────────────────╯
```

### `stressllm check` — Direct .gguf analysis

```bash
stressllm check ./models/gemma-2b-q4.gguf --n-gpu -1
```

Loads a .gguf file directly into memory (no Ollama needed) and benchmarks it.

| Option | Default | Description |
|--------|---------|-------------|
| `--n-gpu` | -1 | GPU layers to offload (-1 = all). |
| `--depth` | 3 | Context steps (1–5). |

Requires `llama-cpp-python`: `pip install stressllm[gguf]`

### `stressllm info` — Hardware & dependency check

```bash
stressllm info
```

Shows detected GPU, RAM, CPU cores, dependency status (Ollama, pynvml, llama-cpp-python), depth level reference, and status legend. Useful for debugging and issue reports.

### `stressllm models` — List available models

```bash
stressllm models
```

Lists all models pulled in Ollama with their size and a ready-to-copy run command for each one.

## Known Limitations

- **TPS measures generation speed.** The model generates 32 tokens at each context step to measure real-world output speed. TTFT (time to first token) measures how fast the model processes your input context.
- **High depths are slow.** Depth 4 (128k) and depth 5 (512k) can take several minutes per step. Start with `--depth 1` or `--depth 2` to verify things work before going deeper. Each step has a default timeout of 5 minutes — use `--timeout 120` to shorten it or `--timeout 0` for no limit.
- **Ctrl+C works during tests.** If a step is taking too long, press Ctrl+C to stop and see partial results for steps already completed.
- **GPU metrics are NVIDIA-only.** AMD and Apple Silicon GPUs won't report VRAM or temperature. The tool still works in CPU-only mode with RAM and CPU% metrics.
- **Model names must be exact.** Use the full name including the tag — `gemma:2b`, not `gemma`. Run `stressllm models` to see exact names available on your machine.

## How It Works

stressllm forces the model to allocate progressively larger KV caches by setting `num_ctx` on each Ollama request. It generates prompts from a pool of 1000 common English words (each word ≈ 1 token) to accurately fill the context window:

| Depth | Context Steps Tested |
|-------|---------------------|
| 1 | 2k |
| 2 | 2k → 8k |
| 3 | 2k → 8k → 32k |
| 4 | 2k → 8k → 32k → 128k |
| 5 | 2k → 8k → 32k → 128k → 512k |

At each step, it measures tokens-per-second (TPS), time-to-first-token (TTFT), and hardware telemetry. The "Performance Cliff" is the context size where TPS drops below usable thresholds:

- **TPS > 15** → ✅ Smooth
- **TPS 5–15** → ⚠️ Slowing
- **TPS < 5** → 💀 Cliff

## FAQ

**What if I don't have a GPU?**
stressllm works fine in CPU-only mode. GPU columns are replaced with CPU% and the verdict adapts accordingly.

**What models work?**
Any model available in Ollama. Run `ollama list` to see what you have pulled.

**How accurate is this?**
The synthetic prompts stress the KV cache but don't perfectly replicate real workloads. Use the results as a ceiling — real-world performance may vary based on prompt complexity.

**I get different results on back-to-back runs?**
Normal. Results can vary ±20% between runs due to thermal throttling, background system load, Ollama's KV cache state, and VRAM fragmentation. If a context size flips between "Slowing" and "Cliff" across runs, that's your borderline — treat it as the edge of what your hardware can handle.

**Ollama isn't detected but it's running?**
Make sure it's serving on the default port: `http://localhost:11434`. Check with `curl http://localhost:11434/api/tags`.

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for the full guide. Quick version:

```bash
git clone https://github.com/iam-vignesh/stressllm
cd stressllm
pip install -e ".[all,dev]"

# Verify
stressllm info

# Run tests and checks
pytest
ruff check src/
bandit -r src/
```

Issues and PRs welcome. Please keep the code simple — this is a CLI tool, not a framework.

## License

MIT
