Metadata-Version: 2.1
Name: fedops-dataset
Version: 0.3.6
Summary: Local-first dataset toolkit for multimodal federated learning artifacts (partition/feature/simulation)
Author: FedOps Dataset Team
License: MIT
Keywords: federated-learning,multimodal,dataset,huggingface,fedops
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: huggingface-hub>=0.25.0
Requires-Dist: typing-extensions>=4.9.0

# fedops-dataset

`fedops-dataset` is a local-first dataset toolkit for multimodal federated learning (FedMS2-v8 style).

It helps you:
- fetch raw multimodal datasets
- validate dataset roots and expected files
- generate FL artifacts (partition, feature, simulation)
- load per-client records in Python for Simulation and Deployment workflows

Python requirement: `>=3.8`

## Who This Is For

- FL researchers working with multimodal datasets
- engineers running FedMS2-style experiments repeatedly
- teams that want reproducible `alpha / ps / pm` artifact generation

## What This Package Covers

1. Raw data bootstrap:
- `fedops-dataset fetch-raw`

2. Raw path validation:
- `fedops-dataset check-raw-datasets`

3. Artifact generation:
- `fedops-dataset create-v8`

4. Runtime loading API:
- `FedOpsLocalDataset`

## Supported Datasets

- `crema_d`
- `hateful_memes`
- `ptb-xl`

Default clients:
- `crema_d`: 40
- `hateful_memes`: 40
- `ptb-xl`: 20

## Install

```bash
pip install fedops-dataset
```

## 5-Minute Quickstart

### 1) Define paths

```bash
export REPO_ROOT=/path/to/fed-multimodal
export DATA_ROOT=$REPO_ROOT/fed_multimodal/data
export OUTPUT_DIR=$REPO_ROOT/fed_multimodal/output
```

### 2) Fetch raw data

```bash
# all supported datasets
fedops-dataset fetch-raw --dataset all --data-root "$DATA_ROOT"
```

Notes:
- `hateful_memes` default fetch method is direct public git from:
  - `https://huggingface.co/datasets/neuralcatcher/hateful_memes`

### 3) Validate raw roots

```bash
fedops-dataset check-raw-datasets --data-root "$DATA_ROOT"
```

### 4) Generate artifacts (example: hateful_memes)

```bash
# dry run first
fedops-dataset create-v8 \
  --dataset hateful_memes \
  --alpha 50 \
  --sample-missing-rate 0.2 \
  --modality-missing-rate 0.2 \
  --repo-root "$REPO_ROOT" \
  --data-root "$DATA_ROOT" \
  --dry-run

# real run
fedops-dataset create-v8 \
  --dataset hateful_memes \
  --alpha 50 \
  --sample-missing-rate 0.2 \
  --modality-missing-rate 0.2 \
  --repo-root "$REPO_ROOT" \
  --data-root "$DATA_ROOT"
```

### 5) Load client records in Python

```python
from fedops_dataset import FedOpsLocalDataset

ds = FedOpsLocalDataset(
    dataset="hateful_memes",
    alpha=50,
    sample_missing_rate=0.2,
    modality_missing_rate=0.2,
    repo_root="/path/to/fed-multimodal",
    data_root="/path/to/fed-multimodal/fed_multimodal/data",
)

print(ds.is_prepared())
client0 = ds.client_records(0, use_simulation=True)
print(len(client0))
```

## Parameter Semantics

- `alpha`: partition heterogeneity control
- `sample_missing_rate` (`ps`): sample-level missingness
- `modality_missing_rate` (`pm`): modality-level missingness

Token naming examples used in artifact filenames:
- `alpha=0.1` -> `alpha01`
- `alpha=5.0` -> `alpha50`
- `alpha=50` -> `alpha50`

So `5.0` and `50` intentionally resolve to the same alpha token.

## CLI Guide

### `fetch-raw`

Use this to prepare raw datasets under your data root.

```bash
fedops-dataset fetch-raw --dataset all --data-root "$DATA_ROOT"
```

#### Hateful Memes fetch modes

1. Default public git mode:

```bash
fedops-dataset fetch-raw \
  --dataset hateful_memes \
  --data-root "$DATA_ROOT" \
  --hateful-memes-fetch-method git \
  --hateful-memes-repo-id neuralcatcher/hateful_memes
```

2. HF snapshot mode (API-based):

```bash
export HF_TOKEN=<optional_token>
fedops-dataset fetch-raw \
  --dataset hateful_memes \
  --data-root "$DATA_ROOT" \
  --hateful-memes-fetch-method hf-snapshot
```

3. Archive URL mode:

```bash
fedops-dataset fetch-raw \
  --dataset hateful_memes \
  --data-root "$DATA_ROOT" \
  --hateful-memes-fetch-method archive \
  --hateful-memes-archive-url https://<host>/hateful_memes.zip
```

4. Manual prepared folder mode:

```bash
fedops-dataset fetch-raw \
  --dataset hateful_memes \
  --data-root "$DATA_ROOT" \
  --hateful-memes-source-dir /path/to/hateful_memes_source \
  --hateful-memes-mode symlink
```

### `check-raw-datasets`

```bash
fedops-dataset check-raw-datasets --data-root "$DATA_ROOT"
```

Use this before `create-v8` to catch path/file issues early.

### `create-v8`

Generates:
- partition JSON
- feature directories
- simulation JSON

```bash
fedops-dataset create-v8 \
  --dataset crema_d \
  --alpha 50 \
  --sample-missing-rate 0.2 \
  --modality-missing-rate 0.2 \
  --repo-root "$REPO_ROOT" \
  --data-root "$DATA_ROOT"
```

Optional controls:
- `--no-partition`
- `--no-features`
- `--no-simulation`
- `--num-clients <N>`
- `--force`
- `--dry-run`

## Python API Guide

## `FedOpsLocalDataset`

### Direct usage

```python
from fedops_dataset import FedOpsLocalDataset

ds = FedOpsLocalDataset(
    dataset="crema_d",
    alpha=50,
    sample_missing_rate=0.2,
    modality_missing_rate=0.2,
    repo_root="/path/to/fed-multimodal",
    data_root="/path/to/fed-multimodal/fed_multimodal/data",
)

if not ds.is_prepared():
    ds.prepare(dry_run=False)

partition = ds.load_partition()
simulation = ds.load_simulation()
records = ds.client_records(0, use_simulation=True)
```

### Runtime config usage (Flower style)

```python
from fedops_dataset import FedOpsLocalDataset

run_config = {
    "repo-root": "/path/to/fed-multimodal",
    "data-root": "/path/to/fed-multimodal/fed_multimodal/data",
}

# Simulation mode
node_config = {"partition-id": 0, "num-partitions": 40}

ds = FedOpsLocalDataset.from_runtime_config(
    dataset="hateful_memes",
    alpha=50,
    sample_missing_rate=0.2,
    modality_missing_rate=0.2,
    run_config=run_config,
    node_config=node_config,
)

mode = ds.node_mode(node_config)  # simulation
records = ds.client_records_from_node_config(node_config, use_simulation=True)
```

## Simulation vs Deployment

Simulation mode:
- detect with `node_config` containing `partition-id` and `num-partitions`
- use `partition-id` to resolve client records

Deployment mode:
- if `node_config` has `data-path`, it is used as data root
- each node can point to different local storage

## Environment Variables (Optional)

```bash
export FEDOPS_REPO_ROOT=/path/to/fed-multimodal
export FEDOPS_OUTPUT_DIR=/path/to/fed-multimodal/fed_multimodal/output
export FEDOPS_DATA_ROOT=/path/to/fed-multimodal/fed_multimodal/data
export HATEFUL_MEMES_ROOT=/path/to/fed-multimodal/fed_multimodal/data/hateful_memes
```

You can use env vars, CLI args, or runtime config keys. No hardcoded path is required.

## Troubleshooting

1. `partition file not found`:
- run `create-v8` first
- verify `alpha/ps/pm` values match existing artifact names

2. `hateful_memes` fetch fails in git mode:
- ensure `git` and `git-lfs` are installed
- use `hf-snapshot` mode as fallback

3. Raw dataset validation errors:
- run `check-raw-datasets` and follow printed hints

4. Alpha confusion (`5.0` vs `50`):
- both map to token `alpha50`
- this is intentional for compatibility with existing FedMS2 artifacts

## FAQ

1. Do I need to pass `--hateful-memes-root` always?
- No. By default it resolves to `<data-root>/hateful_memes`.

2. Can I use this package without Hugging Face uploads?
- Yes. Local-first workflow is the primary mode.

3. Is `FedOpsDatasetClient` still available?
- Yes. Use it if you also host artifacts in an HF dataset repo.

## Maintainer Release

```bash
cd fedops_dataset
python -m build
python -m twine check dist/*
python -m twine upload dist/*
```
