Metadata-Version: 2.3
Name: easy-dataset-share
Version: 0.3.1
Summary: CLI tool to responsibly share datasets by gzipping, canarying, and tracking provenance.
License: Other/Proprietary
Keywords: dataset,sharing,encryption,canary,robots,cli
Author: Edward Turner
Author-email: edward.turner01@outlook.com
Requires-Python: >=3.10,<4.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Requires-Dist: click (>=8.0,<9.0)
Requires-Dist: cryptography (>=41.0.0,<42.0.0)
Requires-Dist: requests (>=2.28.0,<3.0.0)
Project-URL: Homepage, https://github.com/Responsible-Dataset-Sharing/easy-dataset-share
Project-URL: Repository, https://github.com/Responsible-Dataset-Sharing/easy-dataset-share
Description-Content-Type: text/markdown

# Easy Dataset Share

A CLI tool that helps AI researchers share datasets responsibly. Prevents evaluation contamination by making datasets easy for researchers to use but hard for automated scrapers to ingest.

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: Other/Proprietary](https://img.shields.io/badge/License-Other%2FProprietary-red.svg)](LICENSE)
[![PyPI](https://img.shields.io/badge/PyPI-easy--dataset--share-blue.svg)](https://pypi.org/project/easy-dataset-share/)
[![GitHub](https://img.shields.io/badge/GitHub-Responsible%20Dataset%20Sharing-green.svg)](https://github.com/Responsible-Dataset-Sharing/easy-dataset-share)

## Features
- **Canary markers**: Unique identifiers to detect if your dataset was used for training
- **Hash verification**: Ensures dataset integrity through SHA256 hashing
- **Protection layers**: ZIP compression, optional encryption, robots.txt
- **Clean removal**: Remove all protection while preserving original data
- **Web hosting** (optional): Deploy a protected download site with CAPTCHA - see [WEB_HOSTING_GUIDE.md](WEB_HOSTING_GUIDE.md)

## Installation

```bash
pip install easy-dataset-share
```

## Quick Start

### Protect a dataset
```bash
easy-dataset-share magic-protect-dir /path/to/dataset -p your-password
```

### Unprotect and clean
```bash
easy-dataset-share magic-unprotect-dir dataset.zip -p your-password --remove-canaries
```

### Verify integrity
```bash
easy-dataset-share hash /path/to/dataset
```

## Options
- `-p, --password` - Password for encryption (optional)
- `-o, --output` - Output file path (default: `<dir>.zip` or `<dir>.zip.enc`)
- `-c, --num-canary-files` - Number of canary files to create (default: 1)
- `-e, --embed-canaries` - Embed canaries in existing files (default: create separate files)
- `-a, --allow-crawling` - Allow web crawling in robots.txt (default: disallow all)
- `-u, --user-agent` - User-agent to target in robots.txt (default: *)
- `-on, --organization-name` - Organization name for TOS (default: "Example Corp")
- `-dn, --dataset-name` - Dataset name for TOS (default: "Example Dataset")
- `-ce, --contact-email` - Contact email for TOS (default: "support@example.com")
- `--no-tos` - Skip adding terms of service file
- `--no-gitignore` - Skip adding directory to .gitignore (default: auto-add)
- `-v, --verbose` - Enable verbose output

## How it Works
1. **Hash** original dataset for integrity baseline
2. **Add** canary markers throughout the dataset
3. **Package** with robots.txt and optional encryption
4. **Verify** integrity when unprotecting (canaries removed, data unchanged)

## Example Workflow
```bash
# Protect
easy-dataset-share magic-protect-dir my_dataset -p secret123

# Share dataset.zip publicly

# Recipients unprotect and remove canaries
easy-dataset-share magic-unprotect-dir dataset.zip -p secret123 --remove-canaries
# Output shows: "📊 Dataset hash: abc123..." (matches original)
```

Use `-v` for verbose output to see hashing details and canary operations.

## Hosting with Anti-Scraper Protection
For datasets hosted outside of Hugging Face, we **strongly recommend** using [Cloudflare Turnstile](https://developers.cloudflare.com/turnstile/get-started/) to add an additional layer of protection against automated AI scrapers.

**Why Cloudflare Turnstile?**
- **Human verification**: Requires user interaction to access downloads
- **Bot detection**: Advanced algorithms identify and block automated requests
- **Privacy-focused**: No tracking cookies or invasive data collection
- **Easy integration**: Simple JavaScript widget with server-side verification
- **Free tier available**: Generous limits for research datasets


This layered approach (dataset protection + hosting protection) provides comprehensive defense against automated data harvesting while maintaining accessibility for legitimate researchers.

# Maintainence + Development
This is meant to be a collaborative and community project. Please feel encouraged to make PRs to update this repo!

For development:
```bash
git clone https://github.com/Responsible-Dataset-Sharing/easy-dataset-share.git
cd easy-dataset-share
pip install -e .
git config core.hooksPath .githooks
```

## Current Maintainers
* Dipika Khullar <dkhullar98@gmail.com>
* Edward Turner <edward.turner01@outlook.com>
* Roy Rinberg <royrinberg@gmail.com>

