Metadata-Version: 2.1
Name: torchelastic
Version: 0.1.0.dev1
Summary: PyTorch Elastic Training
Home-page: https://github.com/pytorch/elastic
Author: PyTorch Elastic Devs
Author-email: torchelastic@fb.com
License: BSD-3
Keywords: pytorch,machine learning,elastic,distributed
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Topic :: System :: Distributed Computing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: torch (>=1.3.0)
Requires-Dist: python-etcd (>=0.4.5)

[![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](LICENSE)[![CircleCI](https://circleci.com/gh/pytorch/elastic.svg?style=svg&circle-token=9bea46e94adbe2f3e0fb2d4054b1b655f2e208c2)](https://circleci.com/gh/pytorch/elastic)

# PyTorch Elastic
PyTorch Elastic (torchelastic) is a framework that enables distributed training jobs to be executed
in a fault tolerant and elastic manner. It provides the primitives and interfaces
for you to write your distributed PyTorch job in such a way that it can be run on 
multiple machines with elasticity; that is, 
your distributed job is able to start as soon as *min* number of workers are 
present and allowed to grow up to *max* number of workers without being stopped
or restarted. 

## Use cases
###### Fault Tolerant Jobs
Jobs that run on infrastructure where nodes get replaced frequently, either due
to flaky hardware or by design. Or mission critical production grade jobs that
need to be run with resilience to failures.

###### Dynamic Capacity Management
Jobs that run on leased capacity that can be taken away at any time (e.g. AWS
spot instances) or shared pools where the pool size can change dynamically
based on demand.

## Quickstart
Use one of the included examples and get a job running by following our [Quickstart 
guide](aws/README.md).

## Requirements
torchelastic requires
* python3
* torch
* etcd

## Installation
```bash
pip install torchelastic
```

## How torchelastic works

torchelastic induces users to think about their PyTorch jobs as a `train_step` and `state`.
It provides the basic control loop that repetitively executes the user provided
`train_step` being aware of faults, exceptions, and cluster membership changes.

##### Train Step

The `train_step` is a *unit of work*, typically (although not necessarily) 
mapping to the processing of a *mini-batch* of training data. All `k` workers
in a distributed torchelastic job run the `train_step()`, each contributing
to the final output. What each worker does in a `train_step` and what input
data it consumes is a user-defined.

In the simplest use-case, torchelastic drives the execution of `train_step` until 
either:

1. the input data is exhausted
2. an unrecoverable failure condition is encountered
3. some other user-defined *end of job* criteria is met

Each `train_step` may not be fully independent since the computations performed
in the previous `train_step` can be used and/or updated in the next one.
    Information is carried across `train_step` calls using the `state` object.

##### State
The `state`, as the name implies, is an object that carries
persistent information throughout the lifetime of the job and is expected to be
updated on each `train_step`. For example, in a training job, one of the
information that the `state` carries is the model weights.
In practice it contains other (meta)data that must be persisted between 
`train_steps`, for instance, the offset or index of the data stream. The `state`
object is the only input parameter to the `train_step`.

##### Train Loop
The `train_step` is executed in a `train_loop` by torchelastic. The `train_loop`
is a fancy `while`-loop that enables the execution of the job with 
fault tolerance and elasticity. torchelastic works at `train_step` granularity,
hence when a fault occurs **during** a `train_step` the computations performed during
the failed `train_step` are considered lost and the state is restored to the previously
succeeded `train_step`.

##### Rendezvous
Torchelastic jobs define a `[min, max]` range of number of workers that it can 
run with. For instance, `[2, 10]` means that the job can start when 
**at least** two workers are present, and can be scaled **up to** ten workers
at runtime. 

Each time there is a change in membership in the set of workers, 
torchelastic runs a`rendezvous`, which serves the following purposes:

1. **barrier** - all nodes will block until `rendezvous` is complete before resuming execution.
2. **role assignment** - on each `rendezvous` each node is assigned a unique integer valued *rank* between `[0, n)` where `n` is the world size  (total number of workers).
3. **world size broadcast** - on each `rendezvous` all nodes receive the new `world_size`. 

The resource manager is free to add/remove instances from a torchelastic job as 
long as the total number of workers remain within `[2, 10]`. This is what we
refer to as **elasticity**. Additionally, in the event of a worker node failure,
as long as the failed node is replaced, torchelastic will detect this event as
a membership change and admit the new worker into the group, making the job
**fault-tolerant**.


For additional details refer to the [README](torchelastic/rendezvous/README.md)
in the rendezvous module.

## Usage
Please refer to the [usage documentation](USAGE.md) for details on how to write
and configure a torchelastic job.

## Join the torchelastic community
* Website: 
* Facebook page:
* Mailing list
* irc: 

See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.

## License
torchelastic is BSD licensed, as found in the [LICENSE](LICENSE) file.


