Metadata-Version: 2.4
Name: torchlitex
Version: 0.1.2
Summary: Tiny DDP training toolkit for quick-launch distributed training loops.
Author: Torchlitex Authors
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Provides-Extra: wandb
Requires-Dist: wandb>=0.16; extra == "wandb"

# torchlitex

Tiny DDP launcher + trainer that exists because PyTorch 2.x still thinks we enjoy 400 lines of torchrun boilerplate and cryptic NCCL errors. This trims it to ~20 lines and keeps `fork` happy on Vast.

## Why not just torchrun?
- You like your code more than the 17 environment variables PyTorch asks you to memorize.
- `torchrun` still feels like a 2010 MPI cosplay.
- You want fork-based spawn that doesn’t randomly faceplant on Vast.
- You want a one-function launcher + trainer, not a CLI maze.

## Install
```bash
pip install -e .
# or with wandb logging
pip install -e .[wandb]
```

## Quick start
```python
from torchlitex.launcher import launch, DistributedConfig
from torchlitex.trainer import Trainer
from torch import nn, optim
import torch

def train_fn(rank, world_size, batch_size, epochs):
    dataset = MyDataset(...)
    model = MyModel(...)
    loss_fn = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=3e-4)

    trainer = Trainer(
        model=model,
        dataset=dataset,
        loss_fn=loss_fn,
        optimizer=optimizer,
        grad_clip_norm=1.0,
        log_every=10,
    )
    trainer.ddp_train_loop(rank, world_size, batch_size=batch_size, epochs=epochs, ckpt_path="ckpt.pt")

if __name__ == "__main__":
    cfg = DistributedConfig(gpus=8)  # backend auto-switches nccl/gloo
    launch(train_fn, cfg, batch_size=64, epochs=20)
```

## Features
- Fork-first DDP launcher (no torchrun, no elastic).
- Auto backend: `nccl` when CUDA exists, `gloo` when you’re on a laptop/CI.
- Trainer: AMP toggle, grad clipping, gradient accumulation, microbatching, optional schedulers, eval hook, callbacks, and EMA support.
- Optional wandb init + logging on rank0 (install with `.[wandb]`).
- DistributedSampler + DataLoader defaults that just work.
- Checkpoint utilities that handle optimizer/scaler safely.
- Rank-aware logging that doesn’t spam.

## Testing levels
- Level 1: CPU unit tests (no dist).
- Level 2: CPU DDP (`backend="gloo"`, world_size=2) to validate spawn/env/sampler.
- Level 3: Single GPU (`gpus=1`) for end-to-end DDP path.
- Level 4: Real multi-GPU (same code, just crank `gpus`).

## PyTorch 2.x roast (lightly toasted)
- DDP config still feels like “choose your own adventure” but every page ends with NCCL complaining.
- torch.distributed docs read like a treasure map; the treasure is another flag.
- “Just use torchrun” is 2020’s “have you tried turning it off and on again?”

torchlitex keeps the good bits of torch 2.x (SDPA, compile, etc.) and sidesteps the distributed busywork. Use it, ship models, spend less time appeasing the NCCL spirits.
