Metadata-Version: 2.4
Name: arc-training
Version: 4.0.1
Summary: Automatic Recovery Controller - Auto-detect and recover from neural network training failures
Home-page: https://github.com/aryankaushik/arc-training
Author: Aryan Kaushik
Author-email: Aryan Kaushik <a.kaushik0908@gmail.com>
License: AGPL-3.0-or-later
Project-URL: Homepage, https://github.com/aryankaushik/arc-training
Project-URL: Documentation, https://arc-training.readthedocs.io
Project-URL: Repository, https://github.com/aryankaushik/arc-training
Project-URL: Changelog, https://github.com/aryankaushik/arc-training/blob/main/CHANGELOG.md
Keywords: deep-learning,pytorch,training,fault-tolerance,nan-recovery,checkpointing,machine-learning
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: torch>=1.9.0
Requires-Dist: numpy>=1.19.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-cov>=2.0; extra == "dev"
Requires-Dist: black>=21.0; extra == "dev"
Requires-Dist: isort>=5.0; extra == "dev"
Requires-Dist: mypy>=0.900; extra == "dev"
Provides-Extra: lightning
Requires-Dist: pytorch-lightning>=1.5.0; extra == "lightning"
Provides-Extra: full
Requires-Dist: scipy>=1.7.0; extra == "full"
Requires-Dist: tqdm>=4.0.0; extra == "full"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# ARC - Autonomous Recovery Controller

[![PyPI version](https://badge.fury.io/py/arc-training.svg)](https://badge.fury.io/py/arc-training)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: AGPL v3](https://img.shields.io/badge/License-AGPL_v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)

**A real-time fault-tolerance framework for neural network training.**

ARC monitors training signals — gradients, loss curvature, Fisher Information — to predict and recover from failures before they crash your run. It uses a Mamba-based state-space model with Evidential Deep Learning for uncertainty-aware failure prediction.

## Key Results

| Metric            | ARC v4.0                                                 |
| ----------------- | -------------------------------------------------------- |
| **Recovery Rate** | 100% on core failure types (NaN, Inf, explosion)         |
| **Overhead**      | ~35% (small models), higher on larger models             |
| **vs torchft**    | 3/3 vs 1/3 recoveries                                    |
| **Models Tested** | YOLOv11, DINOv2, Llama-Style, SD-UNet (up to 33M params) |

## Quick Start

### Installation

```bash
pip install arc-training
```

### Basic Usage

```python
from arc import Arc

# Wrap your model and optimizer
controller = Arc(model, optimizer)

# Training loop
for batch in dataloader:
    loss = model(batch)

    # ARC monitors, predicts, and recovers automatically
    action = controller.step(loss)

    if not action.rolled_back:
        loss.backward()
        optimizer.step()
```

### Lightning Integration

```python
from arc import ArcCallback

trainer = pl.Trainer(
    callbacks=[ArcCallback()]
)
```

## Configurations

| Config       | Overhead | Use Case                |
| ------------ | -------- | ----------------------- |
| **ARC Lite** | Lower    | Production training     |
| ARC Full     | Higher   | Debugging unstable runs |

## Failure Coverage

| Failure Type       | Detection | Recovery | Status         |
| ------------------ | --------- | -------- | -------------- |
| NaN/Inf Loss       | Yes       | Yes      | Validated      |
| Loss Explosion     | Yes       | Yes      | Validated      |
| Gradient Explosion | Yes       | Yes      | Validated      |
| OOM (all stages)   | Yes       | Yes      | Validated      |
| Accuracy Collapse  | Yes       | Partial  | Detection only |
| Mode Collapse      | Yes       | Partial  | Detection only |

## Benchmarks

```
Core Failure Recovery: 100% (NaN, Inf, explosion across all tests)
Modern Model Recovery: 5/8 induced failures recovered
torchft Comparison:    ARC 3/3 vs torchft 1/3
Models Tested:         YOLOv11, DINOv2-Small, Llama-Style, SD-UNet
```

## Links

- [Efficiency Report](ARC_EFFICIENCY_REPORT.md)

## Citation

```bibtex
@software{arc2026,
  title={ARC: Autonomous Recovery Controller for Neural Network Training},
  author={Kaushik, Aryan},
  year={2026},
  url={https://github.com/a-kaushik2209/ARC}
}
```

## License

AGPL-3.0 License - see [LICENSE](LICENSE) for details.

Copyright (c) 2026 Aryan Kaushik. All rights reserved.
