Metadata-Version: 2.3
Name: nvidia-resiliency-ext
Version: 0.3.0
Summary: NVIDIA Resiliency Package
License: Apache 2.0
Author: NVIDIA Corporation
Requires-Python: >=3.10
Classifier: Development Status :: 4 - Beta
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: defusedxml
Requires-Dist: nvidia-ml-py (>=12.570.86)
Requires-Dist: packaging
Requires-Dist: psutil (>=6.0.0)
Requires-Dist: pynvml (>=12.0.0)
Requires-Dist: pyyaml
Requires-Dist: torch (>=2.3.0)
Project-URL: Repository, https://github.com/NVIDIA/nvidia-resiliency-ext
Description-Content-Type: text/markdown

# NVIDIA Resiliency Extension

The NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads.

## Core Components and Capabilities

- **Fault Tolerance**
  - Detection of hung ranks.  
  - Restarting training in-job, without the need to reallocate SLURM nodes.

- **In-Process Restarting**
  - Detecting failures and enabling quick recovery.

- **Async Checkpointing**
  - Providing an efficient framework for asynchronous checkpointing.

- **Local Checkpointing**
  - Providing an efficient framework for local checkpointing.

- **Straggler Detection**
  - Monitoring GPU and CPU performance of ranks.  
  - Identifying slower ranks that may impede overall training efficiency.

- **PyTorch Lightning Callbacks**
  - Facilitating seamless NVRx integration with PyTorch Lightning.

## Installation

### From sources
- `git clone https://github.com/NVIDIA/nvidia-resiliency-ext`
- `cd nvidia-resiliency-ext`
- `pip install .`


### From PyPI wheel
- `pip install nvidia-resiliency-ext`

### Platform Support

| Category             | Supported Versions / Requirements                                          |
|----------------------|----------------------------------------------------------------------------|
| Architecture         | x86_64, arm64                                                              |
| Operating System     | Ubuntu 22.04, 24.04                                                        |
| Python Version       | >= 3.10, < 3.13                                                            |
| PyTorch Version      | >= 2.3.1 (injob & chkpt), 2.5.1 & 2.6.0 (inprocess)                        |
| CUDA & CUDA Toolkit  | >= 12.5 (12.8 required for GPU health check)                               |
| NVML Driver          | >= 535 (570 required for GPU health check)                                 |
| NCCL Version         | >= 2.21.5 (injob & chkpt), >= 2.21.5 and <= 2.22.3 or 2.26.2 (inprocess)   |

## Usage

For detailed documentation and usage information about each component, please refer to the ./docs.

