Metadata-Version: 2.4
Name: arc-training
Version: 4.0.0
Summary: Automatic Recovery Controller - Auto-detect and recover from neural network training failures
Home-page: https://github.com/aryankaushik/arc-training
Author: Aryan Kaushik
Author-email: Aryan Kaushik <a.kaushik0908@gmail.com>
License: AGPL-3.0-or-later
Project-URL: Homepage, https://github.com/aryankaushik/arc-training
Project-URL: Documentation, https://arc-training.readthedocs.io
Project-URL: Repository, https://github.com/aryankaushik/arc-training
Project-URL: Changelog, https://github.com/aryankaushik/arc-training/blob/main/CHANGELOG.md
Keywords: deep-learning,pytorch,training,fault-tolerance,nan-recovery,checkpointing,machine-learning
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: torch>=1.9.0
Requires-Dist: numpy>=1.19.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-cov>=2.0; extra == "dev"
Requires-Dist: black>=21.0; extra == "dev"
Requires-Dist: isort>=5.0; extra == "dev"
Requires-Dist: mypy>=0.900; extra == "dev"
Provides-Extra: lightning
Requires-Dist: pytorch-lightning>=1.5.0; extra == "lightning"
Provides-Extra: full
Requires-Dist: scipy>=1.7.0; extra == "full"
Requires-Dist: tqdm>=4.0.0; extra == "full"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# ARC - Automatic Recovery Controller

[![PyPI version](https://badge.fury.io/py/arc-training.svg)](https://badge.fury.io/py/arc-training)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: AGPL v3](https://img.shields.io/badge/License-AGPL_v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)
[![Tests](https://github.com/aryankaushik/arc-training/actions/workflows/tests.yml/badge.svg)](https://github.com/aryankaushik/arc-training/actions)

**Auto-detect and recover from neural network training failures.**

ARC automatically detects NaN/Inf losses, gradient explosions, OOM errors, and silent failures—then recovers your training without losing progress.

## Key Results

| Metric             | ARC v4.0              |
| ------------------ | --------------------- |
| **Recovery Rate**  | 100% (on test suite)  |
| **Overhead**       | 27% (ARC Lite)        |
| **vs torchft**     | 3/3 vs 1/3 recoveries |
| **Max Model Size** | 1.5B params           |

## 🚀 Quick Start

### Installation

```bash
pip install arc-training
```

### Basic Usage (3 lines!)

```python
from arc import WeightRollback

# Initialize
arc = WeightRollback(model, optimizer)

# Training loop
for batch in dataloader:
    loss = model(batch)

    # ARC handles everything
    action = arc.step(loss)

    if not action.rolled_back:
        loss.backward()
        optimizer.step()
```

### Lightning Integration

```python
from arc.integrations import ARCCallback

trainer = pl.Trainer(
    callbacks=[ARCCallback()]
)
```

## 📊 Configurations

| Config       | Overhead | Use Case                |
| ------------ | -------- | ----------------------- |
| **ARC Lite** | 27%      | Production training     |
| ARC Full     | 44%      | Debugging unstable runs |

## 🛡️ What ARC Handles

| Failure Type       | Detection | Recovery | Status         |
| ------------------ | --------- | -------- | -------------- |
| NaN/Inf Loss       | ✅        | ✅       | Validated      |
| Loss Explosion     | ✅        | ✅       | Validated      |
| Gradient Explosion | ✅        | ✅       | Validated      |
| OOM (all stages)   | ✅        | ✅       | Validated      |
| Accuracy Collapse  | ✅        | ⚠️       | Detection only |
| Mode Collapse      | ✅        | ⚠️       | Detection only |

## 📈 Benchmarks

```
Recovery Rate: 100% (160/160 induced failures)
Statistical Significance: p < 0.001
Models Tested: CNN, ViT, Transformer, Diffusion (up to 1.5B params)
```

## 🔗 Links

- [Documentation](https://arc-training.readthedocs.io)
- [Paper](link-to-paper)
- [Efficiency Report](ARC_EFFICIENCY_REPORT.md)

## 📜 Citation

```bibtex
@software{arc2026,
  title={ARC: Automatic Recovery Controller for Neural Network Training},
  author={Kaushik, Aryan},
  year={2026},
  url={https://github.com/aryankaushik/arc-training}
}
```

## 📄 License

AGPL-3.0 License - see [LICENSE](LICENSE) for details.

Copyright (c) 2026 Aryan Kaushik. All rights reserved.
