Metadata-Version: 2.4
Name: rapidfit
Version: 0.1.0
Summary: Build multi-task classifiers and augment classification datasets with ease
Author-email: Abu Bakr Soliman <bakrianoo@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/bakrianoo/RapidFit
Project-URL: Repository, https://github.com/bakrianoo/RapidFit
Keywords: machine-learning,transformers,multi-task-learning,classification,data-augmentation,nlp
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai>=2.0.0
Requires-Dist: json-repair>=0.55.0
Requires-Dist: rich>=14.0.0
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.57.3
Requires-Dist: datasets>=4.2.0
Requires-Dist: scikit-learn>=1.6.1
Requires-Dist: accelerate>=0.26.0
Dynamic: license-file

# RapidFit

Build multi-task classifiers and augment classification datasets with ease.

## Features

- **Multi-task Classification**: Train classifiers that handle multiple classification tasks simultaneously
- **Data Augmentation**: Expand and enhance your classification datasets using LLM-based generation
- **Flexible Saving**: Save data in JSON, JSONL, or CSV formats with incremental or batch saving

## Installation

```bash
pip install rapidfit
```

## Development Installation

```bash
pip install -e .
```

## Quick Start

### Data Augmentation

```python
from rapidfit import LLMAugmenter, SaveFormat

# Prepare seed data
seed_data = {
    "sentiment-analysis": [
        {"text": "I love this product!", "label": "positive"},
        {"text": "Terrible experience.", "label": "negative"},
        {"text": "It's okay, nothing special.", "label": "neutral"},
    ],
    "emotion-analysis": [
        {"text": "This makes me so happy!", "label": "joy"},
        {"text": "I can't believe they did this.", "label": "anger"},
        {"text": "I miss the old days.", "label": "sadness"},
    ],
}

# Initialize augmenter
augmenter = LLMAugmenter(
    api_key="your-openai-api-key",
    base_url=None,  # Optional: custom API endpoint
    model_id="gpt-4.1-mini",  # Optional: model to use
    max_samples_per_task=128,  # Optional: max samples per task
    batch_size=8,  # Optional: samples per generation batch
    max_temperature=0.9,  # Optional: max temperature for sampling
    save_path="./saved",  # Optional: output directory
    save_format=SaveFormat.JSON,  # Optional: json, jsonl, or csv
    save_separated=False,  # Optional: separate file per task
    save_incremental=True,  # Optional: save while generating
)

# Augment dataset
augmented_data = augmenter.augment(seed_data)
```

### Save Options

| Parameter | Default | Description |
|-----------|---------|-------------|
| `save_path` | `"./saved"` | Directory for output files |
| `save_format` | `json` | Output format: `json`, `jsonl`, `csv` |
| `save_separated` | `False` | Create separate file for each task |
| `save_incremental` | `True` | Save progressively during generation |

### Custom Augmenter

Extend `BaseAugmenter` to create custom augmentation strategies:

```python
from rapidfit import BaseAugmenter, SeedData

class MyAugmenter(BaseAugmenter):
    def augment(self, seed_data: SeedData) -> SeedData:
        # Your augmentation logic here
        return seed_data
```

## Running

### Install Dependencies

```bash
pip install -e .
```

### Run Augmentation

```bash
export OPENAI_API_KEY="your-api-key"
python examples/test_augmentation.py
```

Optional environment variables:
- `OPENAI_BASE_URL` - Custom API endpoint
- `OPENAI_MODEL_ID` - Model to use (default: `gpt-4.1-mini`)

Output saves to `./saved/` directory.

## Classification

### Training a Multihead Classifier

```python
from rapidfit import MultiheadClassifier

# Prepare data (or load from augmented files)
seed_data = {
    "sentiment": [
        {"text": "I love this!", "label": "positive"},
        {"text": "Terrible.", "label": "negative"},
    ],
    "emotion": [
        {"text": "So happy!", "label": "joy"},
        {"text": "I'm angry.", "label": "anger"},
    ],
}

# Create classifier with custom config
classifier = MultiheadClassifier({
    "model_name": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
    "batch_size": 16,
    "epochs": 10,
    "freeze_epochs": 3,
    "patience": 3,
})

# Train
classifier.train(seed_data)

# Save model
classifier.save("./model")

# Predict
predictions = classifier.predict(["Great product!"], task="sentiment")
print(predictions)  # [{"label": "positive", "confidence": 0.95}]

# Predict all tasks at once
all_preds = classifier.predict_all_tasks(["Great product!"])
```

### Load a Trained Model

```python
from rapidfit import MultiheadClassifier

classifier = MultiheadClassifier()
classifier.load("./model")

predictions = classifier.predict(["Test text"], task="sentiment")
```

### Training Configuration

| Parameter | Default | Description |
|-----------|---------|-------------|
| `model_name` | `paraphrase-multilingual-MiniLM-L12-v2` | HuggingFace model |
| `batch_size` | `16` | Training batch size |
| `epochs` | `10` | Fine-tuning epochs |
| `freeze_epochs` | `3` | Epochs with frozen encoder |
| `learning_rate` | `2e-5` | Learning rate for fine-tuning |
| `patience` | `3` | Early stopping patience |
| `dropout_rate` | `0.2` | Dropout rate |
| `label_smoothing` | `0.1` | Label smoothing factor |
| `use_class_weights` | `True` | Handle class imbalance |
| `test_size` | `0.1` | Test split ratio |
| `val_size` | `0.1` | Validation split ratio |

### Run Training Example

```bash
# First, generate augmented data
python examples/test_augmentation.py

# Then train classifier
python examples/train_classifier.py
```

### Custom Classifier

Extend `BaseClassifier` to create custom classification strategies:

```python
from rapidfit import BaseClassifier
from rapidfit.types import AugmentResult, Prediction, SeedData

class MyClassifier(BaseClassifier):
    def train(self, data: SeedData | AugmentResult) -> None:
        samples = self._resolve_data(data)
        # Training logic here

    def predict(self, texts: list[str], task: str) -> list[Prediction]:
        # Prediction logic here
        return [{"label": "class", "confidence": 0.95}]

    def save(self, path):
        # Save model

    def load(self, path):
        # Load model
```

### Available Classifier Types

| Type | Description |
|------|-------------|
| `MULTIHEAD` | Shared encoder with task-specific classification heads |

## License

MIT
