Metadata-Version: 2.4
Name: trainkeeper
Version: 0.2.3
Summary: Minimal-decision tools for reproducible, debuggable training experiments.
Author: Mohamed Salem
License: Apache-2.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyyaml>=6.0
Requires-Dist: numpy>=1.23
Requires-Dist: pandas>=1.5
Provides-Extra: torch
Requires-Dist: torch>=2.0; extra == "torch"
Provides-Extra: vision
Requires-Dist: torchvision; extra == "vision"
Provides-Extra: nlp
Requires-Dist: datasets; extra == "nlp"
Requires-Dist: transformers; extra == "nlp"
Requires-Dist: torch; extra == "nlp"
Provides-Extra: tabular
Requires-Dist: scikit-learn; extra == "tabular"
Requires-Dist: openml; extra == "tabular"
Provides-Extra: wandb
Requires-Dist: wandb>=0.16; extra == "wandb"
Provides-Extra: mlflow
Requires-Dist: mlflow>=2.9; extra == "mlflow"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: mkdocs>=1.5; extra == "dev"
Requires-Dist: mkdocs-material>=9.0; extra == "dev"
Provides-Extra: bench
Requires-Dist: matplotlib>=3.7; extra == "bench"
Dynamic: license-file

![TrainKeeper logo](https://i.ibb.co/fz6bhjYd/20a1c6f0-e776-402d-b1d5-d0706928e9e8.png)

<p align="center">
  <img src="https://img.shields.io/pypi/v/trainkeeper?color=blue&label=PyPI&logo=pypi" />
  <img src="https://img.shields.io/github/license/mosh3eb/TrainKeeper?color=green" />
  <img src="https://img.shields.io/github/stars/mosh3eb/TrainKeeper?style=social" />
</p>

# TrainKeeper  
### Training-Time System Guardrails for Reliable AI

TrainKeeper is a **training-time reliability framework** for machine learning systems.  
It adds lightweight guardrails around existing training code to make experiments:

- reproducible  
- debuggable  
- data-safe  
- training-stable  
- and system-verifiable  

without replacing your stack.

TrainKeeper focuses on what most frameworks ignore:  
👉 what happens *inside* the training loop.

---

## 🚨 Why TrainKeeper exists

Most critical ML failures are **silent**:

- non-deterministic experiments  
- unnoticed data corruption or drift  
- exploding / vanishing gradients  
- NaN loss propagation  
- broken resumes and unreproducible results  

TrainKeeper turns training into a **controlled system** rather than a script.

It does this by providing:

- experiment control  
- data integrity checks  
- training-time instrumentation  
- automatic failure capture  
- and system-level validation scenarios

---

## 📦 Install

```bash
pip install trainkeeper
```

Optional extras:

```bash
pip install trainkeeper[torch]
pip install trainkeeper[wandb]
pip install trainkeeper[mlflow]
```

## ⚡ Quick start

```python
from trainkeeper.experiment import run_reproducible

@run_reproducible(auto_capture_git=True)
def train():
    print("TrainKeeper is running.")
    # your normal training loop

if __name__ == "__main__":
    train()
```

Each run automatically produces:

- `experiment.yaml`, `run.json`
- `system.json`, `env.txt`
- `seeds.json`, `run.sh`
- checkpoints and failure reports

No pipeline rewrite. No framework lock-in.

---

## 🧠 Core runtime modules

| Module | Purpose |
|--------|---------|
| `experiment` | reproducible runs, environment capture, replay |
| `datacheck` | schema enforcement, drift detection, data profiling |
| `debugger` | training hooks, instability detection, failure snapshots |
| `trainutils` | deterministic dataloaders, mixed precision, checkpoints |
| `monitor` | runtime metrics and behavior tracking |
| `pkg` | export helpers (ONNX, TorchScript, packaging) |

---

## 🖥 CLI

```bash
tk init
tk run -- python train.py
tk replay <exp-id> -- python train.py
tk compare <exp-a> <exp-b>
tk repro-summary <runs-dir>
tk doctor
```

The CLI exposes TrainKeeper as a system tool, not just a library.

---

## 🧪 System validation (what makes TrainKeeper different)

TrainKeeper is not only a framework.  
It is validated through a multi-scenario reliability suite (in the GitHub repo):

**Scenario 1 — Reproducibility Lab**  
Deterministic execution, resume behavior, experiment traceability.

**Scenario 2 — Data Corruption Lab**  
Schema violations, NaNs, label shift, silent distribution drift.

**Scenario 3 — Training Robustness Lab**  
Exploding gradients, NaN loss, optimizer instability, bad batch capture.

These scenarios are orchestrated by a system hardening layer that produces:

- unified summaries
- failure matrices
- cross-scenario system reports

**TrainKeeper therefore tests itself.**

- PyPI package = runtime framework only
- Scenarios & system tests = repository-only

---

## 🏗 Architecture

TrainKeeper inserts a guardrail layer between your training code and the system.

```
User Training Code
        ↓
TrainKeeper Runtime (experiment, datacheck, debugger, trainutils)
        ↓
Structured Artifacts & Reports
        ↓
System Validation Layer (scenarios + system tests)
```

(Full architecture diagram is available in the GitHub repository.)

---

## 🎓 Typical use cases

- research reproducibility & experiment audits
- training-time debugging
- data integrity enforcement
- reliability testing for ML systems
- controlled failure experiments
- AI systems research platforms

---

## 🔗 Project links

- **GitHub**: https://github.com/mosh3eb/TrainKeeper
- **Issues & roadmap**: https://github.com/mosh3eb/TrainKeeper/issues
