Metadata-Version: 2.1
Name: engawa
Version: 0.1.5
Summary: 
Author: sobamchan
Author-email: oh.sore.sore.soutarou@gmail.com
Requires-Python: >=3.10.8,<4.0.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: click (>=8.1.3,<9.0.0)
Requires-Dist: datasets (>=2.8.0,<3.0.0)
Requires-Dist: nltk (>=3.8,<4.0)
Requires-Dist: pytorch-lightning (>=1.8.6,<2.0.0)
Requires-Dist: sentencepiece (>=0.1.97,<0.2.0)
Requires-Dist: sienna (>=0.1.5,<0.2.0)
Requires-Dist: tokenizers (>=0.13.2,<0.14.0)
Requires-Dist: transformers (>=4.25.1,<5.0.0)
Requires-Dist: typer (>=0.9.0,<0.10.0)
Requires-Dist: wandb (>=0.13.7,<0.14.0)
Description-Content-Type: text/markdown

# engawa

<img align="center" src="img/logo.jpg" width="200" height="200" />

**NOT YET FULLY TESTED**

A simple implementation to pre-train BART from scratch with your own corpus.


# Usage

Soon, I will make this pip-installable with CLI commands but at the moment, you need to run it as a repository.

## Installation

```bash
pip install engawa
```

## Build tokenizer

```bash
engawa train-tokenizer --data-path /path/to/train.txt --save-dir /path/to/save

# Checkout other options by
engawa train-tokenizer --help
```

## Pre-train BART

```bash
engawa train-model \
  --tokenizer-file /path/to/tokenizer.json \
  --train-file /path/to/train.txt \
  --val-file /path/to/val.txt \
  --default-root-dir /path/to/save/things

# Checkout other options by
engawa train-model --help
```

