Metadata-Version: 2.4
Name: slurmq
Version: 0.0.2
Summary: Slurm GPU quota monitoring and management
Keywords: slurm,hpc,gpu,quota,monitoring
Author: Dedalus Labs
Author-email: Dedalus Labs <oss@dedaluslabs.ai>
License-Expression: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: System :: Clustering
Classifier: Topic :: System :: Monitoring
Requires-Dist: typer>=0.12
Requires-Dist: rich>=13.0
Requires-Dist: textual>=0.50
Requires-Dist: pydantic>=2.0
Requires-Dist: pydantic-settings>=2.0
Requires-Dist: tomli-w>=1.0
Requires-Dist: pandas>=2.0
Requires-Dist: tabulate>=0.9
Requires-Dist: platformdirs>=4.0
Requires-Dist: pytest>=8.0 ; extra == 'dev'
Requires-Dist: pytest-cov>=4.0 ; extra == 'dev'
Requires-Dist: ruff>=0.4 ; extra == 'dev'
Requires-Dist: ty ; extra == 'dev'
Requires-Dist: mkdocs>=1.6 ; extra == 'docs'
Requires-Dist: mkdocs-material>=9.5 ; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.25 ; extra == 'docs'
Requires-Dist: mkdocs-llmstxt>=0.2 ; extra == 'docs'
Requires-Dist: pymdown-extensions>=10.0 ; extra == 'docs'
Requires-Python: >=3.11
Project-URL: Documentation, https://dedalus-labs.github.io/slurmq
Project-URL: Homepage, https://github.com/dedalus-labs/slurmq
Project-URL: Issues, https://github.com/dedalus-labs/slurmq/issues
Project-URL: Repository, https://github.com/dedalus-labs/slurmq.git
Provides-Extra: dev
Provides-Extra: docs
Description-Content-Type: text/markdown

# slurmq

GPU quota management for SLURM clusters.

```console
$ slurmq check

╭──────────────────── GPU Quota Report ────────────────────╮
│                                                          │
│   User:     dedalus                                      │
│   QoS:      medium                                       │
│   Cluster:  Stella HPC                                   │
│                                                          │
│   ████████████████████░░░░░░░░░░ 68.5%                   │
│                                                          │
│   Used:      342.5 GPU-hours                             │
│   Remaining: 157.5 GPU-hours                             │
│   Quota:     500 GPU-hours (rolling 30 days)             │
│                                                          │
╰──────────────────────────────────────────────────────────╯
```

## Install

```bash
uv tool install slurmq
```

## Setup

```bash
slurmq config init       # interactive wizard
slurmq config show       # verify settings
slurmq config validate   # check syntax before deploy
```

Config resolution order:

1. `SLURMQ_CONFIG` env var
2. `~/.config/slurmq/config.toml` (user)
3. `/etc/slurmq/config.toml` (system-wide)

```toml
default_cluster = "stella"

[clusters.stella]
name = "Stella HPC"
account = "research"
qos = ["low", "medium"]
quota_limit = 500        # GPU-hours
rolling_window_days = 30
```

## Commands

### check

```bash
slurmq check                  # current user
slurmq check --user alice     # specific user
slurmq check --cluster other  # different cluster
slurmq check --forecast       # usage projection
slurmq --json check           # machine-readable
slurmq --quiet check          # silent on success (for scripts)
```

### efficiency

Analyze job resource efficiency (like `seff`).

```bash
slurmq efficiency 12345
```

Flags low efficiency: CPU < 30%, Memory < 20%.

### report

Generate usage reports (admin).

```bash
slurmq report                          # table view
slurmq report --format csv -o out.csv
```

### monitor

Real-time monitoring with optional enforcement (admin).

```bash
slurmq monitor                # live dashboard, 30s refresh
slurmq monitor --interval 10
slurmq monitor --once         # single check, for cron
slurmq monitor --enforce      # cancel jobs over quota
```

### stats

Cluster-wide analytics with month-over-month comparison.

```bash
slurmq stats                          # GPU utilization + wait times
slurmq stats --days 14                # custom period
slurmq stats --no-compare             # skip MoM comparison
slurmq stats -p gpu -p gpu-large      # specific partitions
slurmq stats --small-threshold 25     # custom job size threshold
slurmq --json stats                   # machine-readable
```

Shows:

- GPU utilization by partition/QoS
- Wait time analysis (median, % jobs waiting > 6h)
- Small vs large job breakdown
- Month-over-month trends

## Enforcement

Cancel jobs automatically when users exceed quota.

```toml
[enforcement]
enabled = true
dry_run = true            # preview mode
grace_period_hours = 24   # warn before cancel
exempt_users = ["admin"]
exempt_job_prefixes = ["checkpoint_"]
```

Run with `slurmq monitor --enforce`. Disable `dry_run` when ready.

Grace period: users exceeding quota get a warning window before jobs are cancelled.

## Job States

Problematic states are highlighted:

| State | Meaning       |
| ----- | ------------- |
| `OOM` | Out of Memory |
| `TO`  | Timeout       |
| `NF`  | Node Failure  |
| `F`   | Failed        |
| `PR`  | Preempted     |

## Scripting

```bash
# check quota status
if slurmq --json check | jq -e '.status == "exceeded"' > /dev/null; then
  echo "Quota exceeded"
fi

# cron: enforce every 5 minutes (quiet mode)
*/5 * * * * slurmq --quiet monitor --once --enforce >> /var/log/slurmq.log 2>&1
```

## Documentation

**Online:** [dedalus-labs.github.io/slurmq](https://dedalus-labs.github.io/slurmq)

**For LLMs:** [llms.txt](https://dedalus-labs.github.io/slurmq/llms.txt) | [llms-full.txt](https://dedalus-labs.github.io/slurmq/llms-full.txt)

**Locally:**

```bash
uv sync --extra docs
uv run mkdocs serve
```

## Development

```bash
git clone https://github.com/dedalus-labs/slurmq.git && cd slurmq
uv sync --all-extras
uv run pytest
uv run ruff check
uv run ty check
```

## License

MIT
