Metadata-Version: 2.1
Name: fast-fit
Version: 0.1.2
Summary: Fast and effective approach for few shot with many classes
Home-page: https://github.com/IBM/fastfit
Author: Elron Bandel & Asaf Yehudai
Author-email: elron.bandel@ibm.com
License: Apache 2.0
Project-URL: Source, https://github.com/IBM/fastfit/
Keywords: text classification machine learning NLP
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Text Processing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch
Requires-Dist: transformers
Provides-Extra: dev
Requires-Dist: check-manifest; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Provides-Extra: test
Requires-Dist: coverage; extra == "test"

## Running the Training Script

Our package provides a convenient command-line tool `train_fastfit` to train text classification models. This tool comes with a variety of configurable parameters to customize your training process.

### Prerequisites

Before running the training script, ensure you have Python installed along with our package and its dependencies. If you haven't already installed our package, you can do so using pip:

```bash
pip install fast-fit
```

### Usage

To run the training script with custom configurations, use the `train_fastfit` command followed by the necessary arguments similar to huggingface training args with few additions relevant for fast-fit. Here's the general syntax for the command:

```bash
train_fastfit --model_name_or_path [MODEL_NAME] --overwrite_output_dir [BOOLEAN] --report_to [REPORT_SETTING] --label_column_name [LABEL_COLUMN_NAME] --text_column_name [TEXT_COLUMN_NAME] --max_steps [MAX_STEPS] --dataloader_drop_last [BOOLEAN] --per_device_train_batch_size [BATCH_SIZE] --per_device_eval_batch_size [BATCH_SIZE] --evaluation_strategy [EVAL_STRATEGY] --num_repeats [NUM_REPEATS] --save_strategy [SAVE_STRATEGY] --proj_dim [PROJECTION_DIMENSION] --fp16 [BOOLEAN] --learning_rate [LEARNING_RATE] --optim [OPTIMIZER] --do_train [BOOLEAN] --do_eval [BOOLEAN] --max_text_length [MAX_TEXT_LENGTH] --output_dir [OUTPUT_DIR] --train_file [TRAIN_FILE] --validation_file [VALIDATION_FILE]
```

Replace the bracketed terms with your desired settings. Here's an explanation of each parameter:

- `model_name_or_path`: Identifier or path of the model (e.g., 'roberta-large').
- `overwrite_output_dir`: Whether to overwrite the output directory (`True` or `False`).
- `report_to`: Destination for logging or reporting (e.g., 'none').
- `label_column_name`: Column name for labels in your dataset.
- `text_column_name`: Column name for text data in your dataset.
- `max_steps`: Maximum number of training steps (e.g., 1500).
- `dataloader_drop_last`: Whether to drop the last incomplete batch (`True` or `False`).
- `per_device_train_batch_size`: Batch size per device during training.
- `per_device_eval_batch_size`: Batch size per device during evaluation.
- `evaluation_strategy`: Strategy for evaluation (e.g., 'no', 'steps').
- `num_repeats`: Number of noisy embeddings repetitions.
- `save_strategy`: Model saving strategy (e.g., 'no', 'epoch').
- `proj_dim`: Projection dimension for model-specific projections (e.g., 128).
- `fp16`: Use of mixed precision training (`True` or `False`).
- `learning_rate`: Learning rate for optimizer (e.g., 1e-5).
- `optim`: Choice of optimizer (e.g., 'adamw_hf').
- `do_train`: Flag to run training (`True` or `False`).
- `do_eval`: Flag to run evaluation (`True` or `False`).
- `max_text_length`: Maximum text sequence length.
- `output_dir`: Output directory for model checkpoints and results.
- `train_file`: Path to training data file.
- `validation_file`: Path to validation data file.

### Example Command

Here's an example of how to use the `run_train` command with specific settings:

```bash
train_fastfit --model_name_or_path roberta-large --overwrite_output_dir True --report_to none --label_column_name label --text_column_name text --max_steps 1500 --dataloader_drop_last True --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --evaluation_strategy no --num_repeats 4 --save_strategy no --proj_dim 128 --learning_rate 1e-5 --optim adamw_hf --do_train True --do_eval True --max_text_length 256 --output_dir ./output --train_file $TRAIN_FILE --validation_file $DEV_FILE
```

### Output

Upon execution, `train_fastfit` will start the training process based on your parameters and output the results, including logs and model checkpoints, to the designated directory.

## Training with python
You can simply run it with your python

```python
from datasets import load_dataset
from fastfit import FastFitTrainer, sample_dataset

# Load a dataset from the Hugging Face Hub
dataset = load_dataset("mteb/banking77")
dataset["validation"] = dataset["test"]

# Down sample the train data for 5-shot training
dataset["train"] = sample_dataset(dataset["train"], label_column="label", num_samples=5)

trainer = FastFitTrainer(
    model_name_or_path="roberta-large",
    overwrite_output_dir=True,
    report_to="none",
    label_column_name="label_text",
    text_column_name="text",
    max_steps=1500,
    dataloader_drop_last=True,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy="no",
    num_repeats=4,
    save_strategy="no",
    proj_dim=128,
    learning_rate=1e-5,
    optim="adamw_hf",
    max_text_length=256,
    output_dir="./output",
    dataset=dataset,
)

model = trainer.train()
results = trainer.evaluate()
test_results = trainer.test()

model.save_pretrained("fast-fit")
```
Then you can use the model for inference
```python
from fastfit import FastFit
from transformers import AutoTokenizer, pipeline

model = FastFit.from_pretrained("fast-fit")
tokenizer = AutoTokenizer.from_pretrained("roberta-large")
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

print(classifier("I love this package!"))
```
