Metadata-Version: 2.1
Name: oobleck
Version: 0.1.1
Summary: A framework for efficient fault tolerance in large scale distributed training with pipeline template.
Author-email: Insu Jang <insujang@umich.edu>
Maintainer-email: Insu Jang <insujang@umich.edu>
License: Copyright (c) 2023 SymbioticLab, The University of Michigan
        
        Permission is hereby granted, free of charge, to any person obtaining a copy of
        this software and associated documentation files (the "Software"), to deal in
        the Software without restriction, including without limitation the rights to
        use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
        of the Software, and to permit persons to whom the Software is furnished to do
        so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch
Requires-Dist: transformers>=4.36.0
Requires-Dist: colossalai==0.3.6
Requires-Dist: click
Requires-Dist: loguru
Requires-Dist: fabric
Requires-Dist: cornstarch
Requires-Dist: grpcio
Requires-Dist: pulp
Provides-Extra: dev
Requires-Dist: torch>=2.1.0; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: isort>=5.12; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-mock; extra == "dev"
Requires-Dist: pytest-grpc; extra == "dev"
Requires-Dist: grpcio-tools; extra == "dev"
Requires-Dist: datasets; extra == "dev"

<h1 align="center">Oobleck<br>
Resilient Distributed Training Framework</h1>

Oobleck is a large-model training framework with fast fault recovery support utilizing the concept of *pipeline templates*.

It is the first training framework that realizes:

- **Dynamic reconfiguration**: Oobleck can reconfigure distributed training configurtation without restart after failures.
- **Pipeline template instantiation**: Oobleck pre-generates a set of pipeline templates, and then combine their instantiated pipelines to form a distributed execution plan. The same set of pipeline templates is reused and different pipelines are instantiated after failures.

## Getting Started

### Install

Use `pip` to install Oobleck:
```
pip install oobleck
```

Oobleck relies on [`cornstarch`](https://github.com/Symbioticlab/cornstarch) for pipeline template and [`Colossal-AI`](https://github.com/hpcaitech/ColossalAI) for training backend.
Optionally, install [`apex`](https://github.com/nvidia/apex), [`xformers`](https://github.com/facebookresearch/xformers) and [`flash-attn`](https://github.com/Dao-AILab/flash-attention) to boost throughput (follow instructions in each README).

### Run

Please refer to [this README](./examples/README.md).

### Cluster Management

Oobleck provides a command line interface (CLI) that manages the cluster. Use `oobleck` to access the master agent:

```
$ oobleck --ip <master_ip> --port <master_port> <command> <command_options>
```
where master port can be found in `stdout` of running:

```
| INFO     | __main__:serve:430 - Running master service on port 45145
```

Currently you can see the list of agents and send a request to gracefully terminate an agent:

```
$ oobleck --ip <master_ip> --port <master_port> get_agent_list
=== Agents ===
[0] IP: node1:10000 Status: up (device indices: 0,1)
[1] IP: node1:10000 Status: up (device indices: 2,3)
[2] IP: node2:10000 Status: up (device indices: 0,1)
[3] IP: node2:10000 Status: up (device indices: 2,3)
==============

$ oobleck --ip <master_ip> --port <master_port> kill_agent --agent_index 2
| INFO     | __main__:KillAgent:340 - Terminating agent 2 on node1:10000
```

## Citation

```bibtex
@inproceedings{oobleck-sosp23,
    title     = {Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates},
    author    = {Jang, Insu and Yang, Zhenning and Zhang, Zhen and Jin, Xin and Chowdhury, Mosharaf},
    booktitle = {ACM SIGOPS 29th Symposium of Operating Systems and Principles (SOSP '23)},
    year      = {2023},
}
```
