Metadata-Version: 2.3
Name: easy-dataset-share
Version: 0.2.1
Summary: CLI tool to responsibly share datasets by gzipping, canarying, and tracking provenance.
License: Other/Proprietary
Keywords: dataset,sharing,encryption,canary,robots,cli
Author: Edward Turner
Author-email: edward.turner01@outlook.com
Requires-Python: >=3.10,<4.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Requires-Dist: click (>=8.0,<9.0)
Requires-Dist: cryptography (>=41.0.0,<42.0.0)
Project-URL: Homepage, https://github.com/Responsible-Dataset-Sharing/easy-dataset-share
Project-URL: Repository, https://github.com/Responsible-Dataset-Sharing/easy-dataset-share
Description-Content-Type: text/markdown

# Easy Dataset Share

A CLI tool that helps AI researchers share datasets responsibly. Prevents evaluation contamination by making datasets easy for researchers to use but hard for automated scrapers to ingest.

## Features
- **Canary markers**: Unique identifiers to detect if your dataset was used for training
- **Hash verification**: Ensures dataset integrity through SHA256 hashing
- **Protection layers**: ZIP compression, optional encryption, robots.txt
- **Clean removal**: Remove all protection while preserving original data

## Installation

```bash
pip install easy-dataset-share
```

For development:
```bash
git clone <repo-url>
cd easy-dataset-share
pip install -e .
git config core.hooksPath .githooks
```

## Quick Start

### Protect a dataset
```bash
easy-dataset-share magic-protect-dir /path/to/dataset -p your-password
```

### Unprotect and clean
```bash
easy-dataset-share magic-unprotect-dir dataset.zip -p your-password --remove-canaries
```

### Verify integrity
```bash
easy-dataset-share hash /path/to/dataset
```

## Commands

**Protection**
- `magic-protect-dir` - Add canaries, robots.txt, zip and encrypt
- `magic-unprotect-dir` - Extract and optionally remove canaries

**Verification**
- `hash` - Hash directory contents (excludes canaries)
- `get-canary-string` - Get dataset's unique identifier

**Individual operations**
- `add-canary` / `remove-canary` - Manage canary files
- `add-robots` / `add-tos` - Add usage restrictions

## How it Works

1. **Hash** original dataset for integrity baseline
2. **Add** canary markers throughout the dataset
3. **Package** with robots.txt and optional encryption
4. **Verify** integrity when unprotecting (canaries removed, data unchanged)

## Example Workflow

```bash
# Protect
easy-dataset-share magic-protect-dir my_dataset -p secret123

# Share dataset.zip publicly

# Recipients unprotect and remove canaries
easy-dataset-share magic-unprotect-dir dataset.zip -p secret123 --remove-canaries
# Output shows: "📊 Dataset hash: abc123..." (matches original)
```

Use `-v` for verbose output to see hashing details and canary operations.
