Metadata-Version: 2.4
Name: fiesta-trainer
Version: 0.1.0
Summary: Fine-tune intfloat/multilingual-e5-large-instruct with LoRA adapters for information-retrieval tasks.
License: MIT License
        
        Copyright (c) 2026 fiesta contributors
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
        
Keywords: nlp,embeddings,information-retrieval,sentence-transformers,lora,peft,multilingual
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.2
Requires-Dist: sentence-transformers>=3.0
Requires-Dist: peft>=0.10
Requires-Dist: transformers>=4.40
Requires-Dist: datasets>=2.18
Requires-Dist: scikit-learn>=1.4
Requires-Dist: pandas>=2.2
Requires-Dist: numpy>=2.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Dynamic: license-file

# multilingual-e5-large-instruct

> Fine-tune [`intfloat/multilingual-e5-large-instruct`](https://huggingface.co/intfloat/multilingual-e5-large-instruct)
> with **LoRA adapters** for your own information-retrieval tasks — in Python 3.14.

---

## Table of Contents

1. [Features](#features)
2. [Prerequisites](#prerequisites)
3. [Installation](#installation)
4. [Quick Start](#quick-start)
5. [Input Data Format](#input-data-format)
6. [API Reference](#api-reference)
7. [Configuration](#configuration)
8. [GPU Setup](#gpu-setup)
9. [Contributing](#contributing)
10. [License](#license)

---

## Features

- **Parameter-efficient fine-tuning** via [PEFT](https://github.com/huggingface/peft) LoRA adapters.
- **Built-in IR evaluation** using `InformationRetrievalEvaluator` (cosine accuracy@10).
- **Cross-platform** checkpoint cleanup with `shutil` (no shell commands).
- **Fully configurable** via dataclasses — no subclassing required.
- **Python 3.14** native type hints throughout.

---

## Prerequisites

| Requirement | Minimum version |
|-------------|----------------|
| Python      | **3.14**        |
| PyTorch     | **2.2**         |
| CUDA (optional) | 11.8+       |

---

## Installation

### 1 — Install PyTorch (GPU recommended)

Follow the official guide to get the correct wheel for your CUDA version:
👉 <https://pytorch.org/get-started/locally/>

Example for **CUDA 12.1**:
```bash
pip install torch==2.2.0+cu121 --index-url https://download.pytorch.org/whl/cu121
```

CPU-only (no GPU):
```bash
pip install torch>=2.2
```

### 2 — Install the package

```bash
pip install fiesta
```

---

## Quick Start

```python
from fiesta import MultilingualE5LargeInstructPipeline

# Build the pipeline — model checkpoints go to ~/models/fiesta/en/my-kb.en/
pipeline = MultilingualE5LargeInstructPipeline(kb_id="my-kb", lang="en")

# Load your data (see Input Data Format below)
raw_data = [...]

# Train — returns a TrainResult with baseline + final metrics
result = pipeline.train(raw_data)

# Inspect improvements
print("Baseline :", result.baseline)
print("Final    :", result.final_metrics)
print("Delta    :", result.improvement())
```

---

## Input Data Format

Each element in the data list must follow this schema:

```python
{
    "chunk_text": str,          # The passage / chunk to be retrieved
    "docId":      str,          # Unique document identifier
    "questions": [
        {
            "augmented_questions": list[str],   # Paraphrased / augmented queries
            "noise_questions":     list[str],   # Negative / noise queries
        },
        # ... more question groups
    ]
}
```

**Minimal example:**

```python
raw_data = [
    {
        "chunk_text": "Paris is the capital of France.",
        "docId": "doc-001",
        "questions": [
            {
                "augmented_questions": [
                    "What is the capital of France?",
                    "Which city serves as France's capital?",
                ],
                "noise_questions": [
                    "Who painted the Mona Lisa?",
                ],
            }
        ],
    }
]
```

---

## API Reference

### `MultilingualE5LargeInstructPipeline`

High-level orchestrator — the main entry point.

```python
MultilingualE5LargeInstructPipeline(
    kb_id:                 str,
    lang:                  str,
    base_dir:              str | Path | None = None,   # default: ~/models/
    lora_settings:         LoraSettings      | None = None,
    training_settings:     TrainingSettings  | None = None,
    preprocessing_settings: dict             | None = None,
)
```

#### `.train(data) -> TrainResult`

Runs preprocessing → fine-tuning → checkpoint cleanup.

---

### `MultilingualE5LargeInstructModelling`

Low-level fine-tuning class.

```python
MultilingualE5LargeInstructModelling(
    lora_settings:     LoraSettings     | None = None,
    training_settings: TrainingSettings | None = None,
)
```

#### `.train(train_dataset, test_dataset, evaluator_data, save_dir) -> TrainResult`

---

### `MultilingualE5LargeInstructPreProcessing`

Data preparation class.

```python
MultilingualE5LargeInstructPreProcessing(
    task_description:          str   | None = None,
    test_size_ratio:           float        = 0.1,
    n_test_samples_per_chunk:  int          = 10,
)
```

#### `.preprocess(data) -> tuple[Dataset, Dataset, dict]`

---

### `LoraSettings`

```python
@dataclass
class LoraSettings:
    r:               int       = 16
    lora_alpha:      int       = 16
    lora_dropout:    float     = 0.0
    bias:            str       = "none"
    target_modules:  list[str] = ["query", "key", "value", "dense"]
```

---

### `TrainingSettings`

```python
@dataclass
class TrainingSettings:
    max_steps:                   int   = 200
    per_device_train_batch_size: int   = 4
    per_device_eval_batch_size:  int   = 32
    learning_rate:               float = 1e-4
    lr_scheduler_type:           str   = "cosine"
    optim:                       str   = "adafactor"
    eval_steps:                  int   = 10
    fp16:                        bool  = True
    early_stopping_patience:     int   = 2
    mini_batch_size:             int   = 128
    evaluator_batch_size:        int   = 32
```

---

### `TrainResult`

```python
@dataclass
class TrainResult:
    baseline:       dict[str, Any]
    final_metrics:  dict[str, Any]

    def improvement(self) -> dict[str, float]: ...
```

---

## Configuration

### Custom save directory

```python
pipeline = MultilingualE5LargeInstructPipeline(
    kb_id="my-kb",
    lang="pt",
    base_dir="/mnt/storage/models",
)
```

### Custom LoRA settings

```python
from fiesta import LoraSettings, MultilingualE5LargeInstructPipeline

pipeline = MultilingualE5LargeInstructPipeline(
    kb_id="my-kb",
    lang="en",
    lora_settings=LoraSettings(r=32, lora_alpha=32, lora_dropout=0.05),
)
```

### Custom training hyper-parameters

```python
from fiesta import TrainingSettings, MultilingualE5LargeInstructPipeline

pipeline = MultilingualE5LargeInstructPipeline(
    kb_id="my-kb",
    lang="en",
    training_settings=TrainingSettings(max_steps=500, learning_rate=5e-5, fp16=False),
)
```

### Custom task description (preprocessing)

```python
from fiesta import MultilingualE5LargeInstructPipeline

pipeline = MultilingualE5LargeInstructPipeline(
    kb_id="legal-kb",
    lang="en",
    preprocessing_settings={
        "task_description": "Given a legal question, retrieve the relevant clause."
    },
)
```

---

## GPU Setup

This package benefits significantly from a CUDA-capable GPU.
When a GPU is detected, the model is automatically moved to it.

| Scenario              | Behaviour                             |
|-----------------------|---------------------------------------|
| CUDA GPU available    | `device="cuda"` (automatic)           |
| No GPU / CPU only     | `device="cpu"` (automatic, slower)    |

To check which device will be used:
```python
import torch
print("CUDA available:", torch.cuda.is_available())
```

For detailed PyTorch + CUDA install instructions:
👉 <https://pytorch.org/get-started/locally/>

---

## Contributing

```bash
git clone https://github.com/your-org/multilingual-e5-large-instruct.git
cd multilingual-e5-large-instruct
pip install -e ".[dev]"
pytest
```

Please open an issue before submitting a PR.

---

## License

MIT © multilingual-e5-large-instruct contributors

