Metadata-Version: 2.4
Name: hpc-launcher
Version: 1.0.3
Summary: LBANN Launcher utilities for distributed jobs on HPC clusters
Home-page: https://github.com/LBANN/HPC-launcher
Author: Lawrence Livermore National Laboratory
Author-email: lbann@llnl.gov
License: Apache-2.0
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: psutil
Requires-Dist: nvidia-ml-py
Provides-Extra: torch
Requires-Dist: torch; extra == "torch"
Requires-Dist: numpy; extra == "torch"
Provides-Extra: mpi
Requires-Dist: mpi4py>=3.1.4; extra == "mpi"
Requires-Dist: mpi_rdv; extra == "mpi"
Provides-Extra: testing
Requires-Dist: pytest; extra == "testing"
Provides-Extra: e2e-testing
Requires-Dist: accelerate; extra == "e2e-testing"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# HPC-launcher Repository

The HPC launcher repository contains a set of helpful scripts and
Python bindings for launching PyTorch (torchrun), LBANN 2.0
(PyTorch-core), or generic scripts on multiple leadership-class HPC
systems.  There are optimized routines for FLUX, SLURM, and LSF
launchers.  Additionally, there are optimized environments for systems
at known compute centers.  Currently there are supported systems at:
 - LLNL Livermore Computing (LC)
 - LBL NERSC (Pending)
 - ORNL OLCF (Pending)
 - RIKEN (Pending)

There are two main entry points into HPC-Launcher from the cli:
`launch` and `torchrun-hpc`.  `torchrun-hpc` is intended as a
replacement for `torchrun`, while `launch` is a generic interface for
launching parallel jobs.

## Installation

To install the package, install released versions from PyPI run:
```bash
pip install hpc-launcher
```

Or install directly from GitHub:
```bash
pip install git+https://github.com/LBANN/HPC-launcher.git
```

## Example Usage

Using the launch command to execute a command in parallel
```
launch -N1 -n1 hostname
```

Using the torchrun-hpc command to execute a PyTorch Python file in parallel on two nodes and four processes per node (8 in total):
```
torchrun-hpc -N2 -n4 file.py [arguments to Python file]
```

Using HPC-Launcher within existing PyTorch code with explicity invoking it from the command line (CLI).  Within the top level Python file, import `hpc_launcher.torch` first to ensure that `torch` is configured per HPC-Launcher's specification.
```
import hpc_launcher.torch
```

## CLI options for HPC-Launcher `launch` and `torchrun-hpc` commands

- [`launch`](./launch_cli.md) - General purpose HPC job launcher
- [`torchrun-hpc`](./torchrun-hpc_cli.md) - PyTorch-specific distributed training launcher

# LBANN: Livermore Big Artificial Neural Network Toolkit

The Livermore Big Artificial Neural Network toolkit (LBANN) is an
open-source, HPC-centric, deep learning training framework that is
optimized to compose multiple levels of parallelism.

LBANN provides model-parallel acceleration through domain
decomposition to optimize for strong scaling of network training.  It
also allows for composition of model-parallelism with both data
parallelism and ensemble training methods for training large neural
networks with massive amounts of data.  LBANN is able to advantage of
tightly-coupled accelerators, low-latency high-bandwidth networking,
and high-bandwidth parallel file systems.

![Block Diagram of LBANN v2.x](LBANN_2.0_Block_Diagram.png "Block
Diagram of LBANN v2.x architecture")

LBANN v2.x is composed of a custom backend LBANN device that is used
to provide processor-centric optimizations such as copy-elision for
AMD MI300A APUs.  Additionally, it is composed of Python, C++, CUDA,
and ROCm custom kernels that extend PyTorch 2.4+.  Libraries such as
DGraph, DistConv, and CheckMate, implement key algorithms using the
PyTorch 2.x API.  Each of these libraries should be both composable as
well as fully separable.  The suite of LBANN 2.x optimizations are
found in the [LBANN](https://github.com/orgs/LBANN/repositories)
GitHub group.

## Publications

A list of publications, presentations and posters are shown
[here](https://lbann.readthedocs.io/en/latest/publications.html).

## Reporting issues
Issues, questions, and bugs can be raised on the [Github issue
tracker](https://github.com/LBANN/HPC-launcher/issues).
