Metadata-Version: 2.4
Name: jazari
Version: 0.1.0
Summary: The orchestration layer for modern Slurm clusters.
Author-email: Levent Ozbek <levent@jazari.run>
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: annotated-types==0.7.0
Requires-Dist: certifi==2025.11.12
Requires-Dist: charset-normalizer==3.4.4
Requires-Dist: click==8.3.1
Requires-Dist: gitdb==4.0.12
Requires-Dist: GitPython==3.1.45
Requires-Dist: idna==3.11
Requires-Dist: Jinja2==3.1.6
Requires-Dist: markdown-it-py==4.0.0
Requires-Dist: MarkupSafe==3.0.3
Requires-Dist: mdurl==0.1.2
Requires-Dist: packaging==25.0
Requires-Dist: platformdirs==4.5.0
Requires-Dist: pydantic==2.12.5
Requires-Dist: pydantic_core==2.41.5
Requires-Dist: Pygments==2.19.2
Requires-Dist: PyYAML==6.0.3
Requires-Dist: requests==2.32.5
Requires-Dist: rich==14.2.0
Requires-Dist: sentry-sdk==2.46.0
Requires-Dist: shellingham==1.5.4
Requires-Dist: smmap==5.0.2
Requires-Dist: typer==0.20.0
Requires-Dist: typing-inspection==0.4.2
Requires-Dist: typing_extensions==4.15.0
Requires-Dist: urllib3==2.5.0
Requires-Dist: wandb==0.17.0
Requires-Dist: huggingface_hub>=0.23.0
Dynamic: license-file

# 🐘 Jazari

**The orchestration layer for modern Slurm clusters.**

Jazari is a command-line tool that makes launching distributed AI/ML training jobs on high-performance computing (HPC) clusters as easy as running a script on your laptop.

It abstracts away the complexity of writing Slurm (`#SBATCH`) scripts, handling multi-node networking, and managing environment variables.

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

---

## Why Jazari?

If you are a researcher using a university or national supercomputer (like Compute Canada/Digital Alliance), you know the pain:

* **The Script Nightmare:** Copy-pasting old bash scripts, accidentally leaving in wrong parameters, and debugging obscure `sbatch` errors.
* **Networking Headaches:** Manually figuring out how to get PyTorch DDP to find the master node's IP address across multiple machines.
* **Experiment Fragmentation:** Having a 4-node training run spawn 4 separate experiments in Weights & Biases.
* **Slurm Arcana:** Remembering obscure flags for accounts, time formats, and memory allocation.

**Jazari solves this.** You write your Python training script, and Jazari handles the rest.

---

## ✨ Key Features

* **🚀 One-Command Launch:** Go from Python script to running distributed job with a single CLI command.
* **🤖 Automatic Slurm Generation:** Uses robust Jinja2 templates to generate correct, safe `#SBATCH` scripts on the fly.
* **🧠 Zero-Config DDP Networking:** Automatically resolves master IP, ports, and world size for PyTorch Distributed Data Parallel.
* **📈 Seamless Weights & Biases:** Securely propagates your local API key and ensures multi-node jobs log as a single, clean run.
* **⚙️ User "Init" Profiles:** Save your default account, time limits, and preferences once and never type them again.
* **🎨 Beautiful CLI:** Modern, color-coded output with rich error reporting.

---

## 🛠️ Installation

Jazari is designed to be installed in a virtual environment on your cluster's login node.

```bash
# 1. Clone the repository
git clone [https://github.com/levoz92/jazari.git](https://github.com/levoz92/jazari.git)
cd jazari

# 2. Create and activate a virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate

# 3. Install in editable mode
pip install -e .
```

Verify the installation:
```bash
jazari --help
```

---

## ⚡ Quick Start

### 1. One-Time Setup

On your cluster login node, run the init command to save your defaults (like your allocation account).

```bash
jazari init
```

### 2. Create a Python Script

Here is a minimal PyTorch DDP example (`train.py`):

```python
import os
import torch
import torch.distributed as dist

def main():
    # 1. Initialize Process Group (Jazari sets all necessary env vars)
    dist.init_process_group(backend="nccl")
    
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    print(f"👋 Hello from rank {rank} of {world_size} on node {os.uname().nodename}!")

    # Your training loop here...

    # 3. Clean up
    dist.destroy_process_group()

if __name__ == "__main__":
    main()
```

### 3. Launch It!

Run your script on 2 nodes with 4 GPUs per node (8 GPUs total).

```bash
jazari run -N 2 -G 4 --name "my-big-run" --track-wandb python train.py --batch-size 128
```

**That's it.** Jazari will generate the script, submit it to Slurm, and stream the output.

---

## 📖 Usage Reference

### `jazari init`
Interactively configure your default settings. These are saved to `~/.jazari/config.yaml`.

### `jazari run [OPTIONS] COMMAND`

| Option | Shorthand | Description | Default |
| :--- | :--- | :--- | :--- |
| `--nodes` | `-N` | Number of compute nodes to request. | `1` (or config default) |
| `--gpus` | `-G` | Number of GPUs per node. | `1` (or config default) |
| `--cpus` | `-c` | Number of CPU cores per task. | `1` (or config default) |
| `--time` | `-t` | Time limit in `D-HH:MM` format (e.g., `0-02:30` for 2.5 hours). | `01:00:00` (or config default) |
| `--account`| `-A` | Slurm account to charge (e.g., `def-user`). | Config default |
| `--name` | `-n` | Name for the job in Slurm and W&B. | `jazari_run` |
| `--track-wandb` | | Auto-configure Weights & Biases environment. | `False` (or config default) |
| `--dry-run` | | Print the generated generated `#SBATCH` script to stdout without submitting. | `False` |

---

## 📄 License

This project is licensed under the [MIT License](LICENSE) - see the LICENSE file for details.
