Metadata-Version: 2.4
Name: collective-trace
Version: 0.1.1
Summary: A monkey-patching tool for tracing collective operations
Home-page: https://github.com/yangrudan/collective_trace
Author: Cookie Yang
Author-email: yangrudan@zhejianglab.org
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Framework :: Pytest
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: torch>=1.10
Requires-Dist: numpy
Requires-Dist: pandas
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: isort; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# collective_trace

[![Pylint](https://github.com/yangrudan/collective_trace/actions/workflows/pylint.yml/badge.svg)](https://github.com/yangrudan/collective_trace/actions/workflows/pylint.yml)

Trace collective operations for distributed training.

## 0x01 Develop

```bash
# Develop
git clone https://github.com/yangrudan/collective_trace.git
cd collective_trace
pip install -e .

cd ..
torchrun --nproc_per_node=4 -m collective_trace.tests.test_in_torch
torchrun --nproc_per_node=4 -m collective_trace.tests.test_in_cpu --sync_mode 

# or
cd collective_trace
PYTHONPATH=/home/yang torchrun --nproc_per_node=4 tests/test_in_torch.py
```

## 0x02 Usage

```bash
# Install
git clone https://github.com/yangrudan/collective_trace.git
cd collective_trace
pip install -e .
```

Manual update training code:

```python
import torch
import torch.distributed as dist

from collective_trace.collective_trace import trace_all_collectives

trace_all_collectives(trace_file='collective_trace.log')

import megatron  # Megatron此时导入的是已替换的函数
# Your training code here

```

<!-- **Prototype**
![Example](docs/image1.png)

**version 0.0**
![Trace](docs/image2.png) -->

**version 0.1 Results**
![Results](docs/image3.png)

## 0x03 Util

```bash
cd utils
python parse_coll_info.py
```

![log](docs/image4.png)

> export PYTHONPATH=/home/yang:$PYTHONPATH  # 设置环境变量
>
>
>import sys
>sys.path.insert(0, '/home/yang')  # 把 /home/yang 路径添加到搜索路径的最前面
