Metadata-Version: 2.4
Name: marco-dvcs
Version: 0.1.20
Summary: A minimal dataset versioning system for text data with a focus on reproducibility.
Home-page: https://github.com/Team-Marco-ACM/marco-package
Author: Your Name
Author-email: your.email@example.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: Flask
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Marco Dataset Versioning System

A minimal dataset versioning system for text data with a strong focus on reproducibility and transparency. Treat your text datasets like code — immutable, versioned, reproducible, and explainable.

Marco acts as a lightweight Python library, meaning you can initialize it in *any* machine learning project folder to safely version and preprocess your datasets without altering your original files.

## 🚀 Installation (Linux / MacOS)

On modern Linux environments (like Arch Linux, Ubuntu 23.04+), Python packages must be installed in a Virtual Environment (PEP 668) to prevent conflicts with your system packages. 

Follow these steps to safely install Marco into your ML project:

1. **Clone this repository** to your local machine:
   ```bash
   git clone https://github.com/your-username/marco.git
   cd marco
   ```

2. **Navigate to the ML project folder** where you want to train your model (e.g. your bag-of-words project):
   ```bash
   cd ~/projects/my-bag-of-words-model
   ```

3. **Create and activate a Python Virtual Environment**:
   ```bash
   # Create a virtual environment named 'venv'
   python3 -m venv venv
   
   # Activate it (You must do this every time you open a new terminal in this folder)
   source venv/bin/activate
   ```
   *(You should now see `(venv)` at the start of your terminal prompt!)*

4. **Install Marco**:
   ```bash
   # Point pip to the directory where you cloned the marco repository
   pip install -e /path/to/marco
   ```

---

## 🛠️ Usage Guide

Once `marco` is installed in your virtual environment, you have access to the full CLI!

### 1. Initialize a Repository
Initialize Marco tracking in your current directory. This creates a `.marco/` data versioning environment specific to that project.
```bash
marco init
```

### 2. Create an Immutable Version 
Upload a text/CSV/TSV dataset to create an immutable version. Marco will compute a cryptographically secure SHA-256 hash using the raw data + the preprocessing configuration.

**Interactive Mode:**
If you don't supply a configuration file, Marco will interactively guide you through building the preprocessing pipeline (Lowercasing, Tokenization, Stopwords Removal, Deduplicating).
```bash
marco upload my_dataset.csv -t v1-raw
```

**Config Mode:**
```bash
marco upload my_dataset.csv -c my_config.json -t v1-processed
```

### 3. List Versions
View all the versions you've created, along with their tags and timestamps.
```bash
marco list
```

### 4. Restore/Checkout Data
Extract the processed dataset from marco's storage back into your active workspace to use for model training.
```bash
marco restore v1-processed -o ./training_data.tsv
```

### 5. Start the Interactive Dashboard (New!)
Marco includes a stunning, built-in Glassmorphism React UI to visualize your dataset evolution.
```bash
marco generate-web
```
*(This will automatically open your browser to `http://localhost:7654` and load your `.marco` repository).*

**Dashboard Features:**
- **Interactive Lineage Tree**: View the exact Git-style chronological history of your datasets in a visual tree.
- **Auto-Compare Mode**: Clicking any dataset instantly fetches its parent and calculates how the data changed (e.g. "+16% tokens", "-5% documents").
- **Red/Green DAG Tracking**: The visual graph highlights process nodes in explicitly color-coded borders (red for token increases, green for token drops) so you can track metric divergence intuitively.
- **Alias Tagging**: Your custom `-t` tags (like `v1-raw`) are beautifully serialized as badges in the UI so you never lose track of hashes.

### 6. Export / Import Versions
Easily share dataset versions with teammates by packing them into `.tar.gz` files.
```bash
# Export version 'v1-raw' to the 'exports' folder
marco export v1-raw ./exports/

# Import an archive sent to you by a coworker
marco import ./exports/marco_version_e5e0b767.tar.gz
```

### 7. Delete Versions
Delete a dataset version to recover disk space. Marco intelligently updates the lineage (the `parents` history) of any descendant versions so that your Git-style history tree remains intact and unbroken.
```bash
marco delete v1-raw
# or
marco rm e5e0b767
```

---

## 🔬 Reproducibility Proof

Marco guarantees that a dataset version is not just stored — it is **provably reproducible**. The `marco/reproducibility_proof/` module lets you mathematically verify that re-running the exact same pipeline on the same raw data produces a byte-identical result.

### 8. Verify a Single Version

Re-runs the full preprocessing pipeline from scratch and compares the freshly computed output hash to the stored hash in the manifest.

```bash
marco verify v1-raw
# or with a full hash
marco verify fe44032c
```

**Example output:**
```
🔬 Verifying version: fe44032c70164a71...

  Pipeline Step Verification:
    ✅  step_1 normalize_newlines
    ✅  step_2 lowercase

  Stored  output_hash: ed610672ea28e065...
  Recomputed   hash:   ed610672ea28e065...

  ✅ VERIFIED — output hash matches. Version is reproducible.
```

- On **PASS**: the manifest is updated with `"verified": true`, a timestamp, the current environment snapshot, and per-step hashes for future comparisons.
- On **FAIL**: a `verification_report.json` is written inside the version folder with a row-level delta showing exactly which rows differ. Exit code is `1`.

---

### 9. Verify All Versions at Once

Audit every version in the repository and get a clean summary table.

```bash
marco verify-all
```

**Example output:**
```
──────────────────────────────────────────────────────────────────────
  Marco Reproducibility Audit — 3 version(s) found
──────────────────────────────────────────────────────────────────────
🔬 Verifying version: a3f9c1d2...  ✅ VERIFIED
🔬 Verifying version: fe44032c...  ✅ VERIFIED
🔬 Verifying version: 83f77823...  ❌ FAIL
──────────────────────────────────────────────────────────────────────
  Summary: 2/3 versions passed.
──────────────────────────────────────────────────────────────────────
```
Returns exit code `1` if any version fails — making it CI/CD friendly.

---

### How the 7-Layer Reproducibility Engine Works

| # | Guarantee | How It Works |
|---|---|---|
| 1 | **Step-by-step hash verification** | Hashes the output after every individual DAG step. Pinpoints exactly which step introduced non-determinism (`✅ step_1`, `❌ step_2`). |
| 2 | **Byte-exact canonical enforcer** | Before hashing, data is serialized into a strictly defined canonical TSV format: Unix line endings only, fixed column order (`label`, `text`, `n_tokens`), UTF-8 encoding. Prevents false `FAIL` results caused by OS differences (Windows CRLF vs Linux LF). |
| 3 | **Row-level delta on FAIL** | When a hash mismatch occurs, computes exactly which rows were added or removed between the stored output and fresh recompute, saved in `verification_report.json`. |
| 4 | **Permanent audit trail** | Every verification run (PASS or FAIL) is appended to `.marco/verification_log.jsonl` with a timestamp, result, and full environment snapshot — a tamper-evident history. |
| 5 | **Environment fingerprinting** | Stores Python version, NumPy/Pandas versions, and platform info in the manifest at verification time. Warns on re-verification if the environment changed: `"Python 3.10 → 3.14 (may explain hash mismatch)"`. |
| 6 | **Stochastic operation detector** | Scans the pipeline DAG config for non-deterministic functions (e.g. `shuffle`, `random_sample`) that lack a `seed` parameter. Warns before verifying so expectations are set correctly. |
| 7 | **Manifest integrity update** | On PASS: sets `"verified": true` in `manifest.json` and persists per-step hashes. On FAIL: sets `"verified": false` and keeps `verification_report.json` alongside the version data. |

---

### Verification Report (`verification_report.json`)

When a version fails, Marco writes a full diagnostic file at `.marco/versions/<version_id>/verification_report.json`:

```json
{
  "version_id": "fe44032c...",
  "verified_at": "2026-03-18T14:01:00Z",
  "result": "FAIL",
  "stored_hash": "ed610672...",
  "computed_hash": "a3f9c1d2...",
  "step_hashes": {
    "step_1": "aabbcc...",
    "step_2": "ddeeff..."
  },
  "stochastic_warnings": [],
  "environment_warnings": ["python_version changed: '3.10' → '3.14'"],
  "row_delta": {
    "count_stored": 1000,
    "count_computed": 998,
    "removed": [["pos", "hello world"], ["neg", "bad product"]],
    "added": []
  }
}
```

---

## 🧠 Architecture Overview

Marco decouples logic from the file system. All core engine operations sit inside `marco/core/`, including:
- `locker.py`: File-based concurrency control using `.lock` files.
- `repository.py`: CRUD operations for dataset versions and `refs.json` tagging. Enforces **file immutability** by setting read-only permissions (`0o444`) on all version files immediately after creation.
- `preprocessor.py`: A robust Directed Acyclic Graph (DAG) preprocessing engine with deterministic topological execution.

### Reproducibility Proof Engine (`marco/reproducibility_proof/`)
- `canonicalizer.py`: Enforces byte-exact canonical serialization (Unix line endings, fixed column order). Also detects stochastic/non-deterministic pipeline operations.
- `audit_log.py`: Appends every verification run (PASS or FAIL) to `.marco/verification_log.jsonl` with a full environment snapshot.
- `verify.py`: End-to-end reproducibility verification with step-by-step hashing, row-level delta reports, environment fingerprinting, and a `verify_all()` batch audit command.

Have fun building safer machine learning pipelines!
