Metadata-Version: 2.1
Name: catbird
Version: 0.0.2
Summary: Paraphrase generation Toolbox and Benchmark
Home-page: https://github.com/AfonsoSalgadoSousa/catbird
License: MIT
Keywords: nlp,paraphrase generation
Author: Afonso Sousa
Author-email: afonsousa2806@gmail.com
Requires-Python: >=3.8,<4.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Dist: addict (>=2.4.0,<3.0.0)
Requires-Dist: datasets (>=1.16.1,<2.0.0)
Requires-Dist: pytorch-ignite (>=0.4.7,<0.5.0)
Requires-Dist: sentencepiece (>=0.1.96,<0.2.0)
Requires-Dist: tensorboardX (>=2.4.1,<3.0.0)
Requires-Dist: torch (>=1.10.0,<2.0.0)
Requires-Dist: transformers (>=4.14.1,<5.0.0)
Project-URL: Documentation, https://github.com/AfonsoSalgadoSousa/catbird
Project-URL: Repository, https://github.com/AfonsoSalgadoSousa/catbird
Description-Content-Type: text/markdown

<div align="center">
    </p>
    <img src="resources/catbird_logo.svg" width="200"/>
    </p>

  [![License: MIT](https://img.shields.io/badge/License-MIT-brightgreen.svg)](https://opensource.org/licenses/MIT)
</div>

`Catbird` is an open source paraphrase generation toolkit based on PyTorch.

## Quick Start

### Requirements and Installation
The project is based on PyTorch 1.5+ and Python 3.6+.

## Install Catbird

**a. Clone the repository.**
```shell
git clone https://github.com/AfonsoSalgadoSousa/catbird.git
```
**b. Install dependencies.**
This project uses Poetry as its package manager. There should Make sure you have it installed. For more info check [Poetry's official documentation](https://python-poetry.org/docs/).
To install dependencies, simply run:
```shell
poetry install
```

## Dataset Preparation
For now, we only work with the [Quora Question Pairs dataset](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs). It is recommended to download and extract the datasets somewhere outside the project directory and symlink the dataset root to `$CATBIRD/data` as below. If your folder structure is different, you may need to change the corresponding paths in config files.

```text
catbird
├── catbird
├── tools
├── configs
├── data
│   ├── quora
│   │   ├── quora_duplicate_questions.tsv
```
We use the [HuggingFace Datasets library](https://huggingface.co/docs/datasets/) to load the datasets.

### Train

```shell
poetry run python tools/train.py ${CONFIG_FILE} [optional arguments]
```

Example:
1. Train T5 on QQP.
```bash
$ python tools/train.py configs/t5_quora.yaml
```

## Contributors
* [Afonso Sousa][1] (afonsousa2806@gmail.com)

[1]: https://github.com/AfonsoSalgadoSousa
