Metadata-Version: 2.4
Name: marco-dvcs
Version: 0.1.25
Summary: A minimal dataset versioning system for text data with a focus on reproducibility.
Home-page: https://github.com/Team-Marco-ACM/marco-package
Author: Your Name
Author-email: your.email@example.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: Flask
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Marco Dataset Versioning System

A minimal dataset versioning system for text data with a strong focus on reproducibility and transparency. Treat your text datasets like code — immutable, versioned, reproducible, and explainable.

Marco acts as a lightweight Python library, meaning you can initialize it in *any* machine learning project folder to safely version and preprocess your datasets without altering your original files.

## 🚀 Installation (Linux / MacOS)

On modern Linux environments (like Arch Linux, Ubuntu 23.04+), Python packages must be installed in a Virtual Environment (PEP 668) to prevent conflicts with your system packages. 

Follow these steps to safely install Marco into your ML project:

1. **Clone this repository** to your local machine:
   ```bash
   git clone https://github.com/your-username/marco.git
   cd marco
   ```

2. **Navigate to the ML project folder** where you want to train your model (e.g. your bag-of-words project):
   ```bash
   cd ~/projects/my-bag-of-words-model
   ```

3. **Create and activate a Python Virtual Environment**:
   ```bash
   # Create a virtual environment named 'venv'
   python3 -m venv venv
   
   # Activate it (You must do this every time you open a new terminal in this folder)
   source venv/bin/activate
   ```
   *(You should now see `(venv)` at the start of your terminal prompt!)*

4. **Install Marco**:
   ```bash
   # Point pip to the directory where you cloned the marco repository
   pip install -e /path/to/marco
   ```

---

## 🛠️ Usage Guide

Once `marco` is installed in your virtual environment, you have access to the full CLI!

### 1. Initialize a Repository
Initialize Marco tracking in your current directory. This creates a `.marco/` data versioning environment specific to that project.
```bash
marco init
```

### 2. Create an Immutable Version 
Upload a text/CSV/TSV dataset to create an immutable version. Marco will compute a cryptographically secure SHA-256 hash using the raw data + the preprocessing configuration.

**Interactive Mode:**
If you don't supply a configuration file, Marco will interactively guide you through building the preprocessing pipeline (Lowercasing, Tokenization, Stopwords Removal, Deduplicating).
```bash
marco upload my_dataset.csv -t v1-raw
```

**Config Mode:**
```bash
marco upload my_dataset.csv -c my_config.json -t v1-processed
```

### 3. List Versions
View all the versions you've created, along with their tags and timestamps.
```bash
marco list
```

### 4. Restore/Checkout Data
Extract the processed dataset from marco's storage back into your active workspace to use for model training.
```bash
marco restore v1-processed -o ./training_data.tsv
```

### 5. Start the Interactive Dashboard (New!)
Marco includes a stunning, built-in Glassmorphism React UI to visualize your dataset evolution.
```bash
marco generate-web
```
*(This will automatically open your browser to `http://localhost:7654` and load your `.marco` repository).*

**Dashboard Features:**
- **Interactive Lineage Tree**: View the exact Git-style chronological history of your datasets in a visual tree.
- **Auto-Compare Mode**: Clicking any dataset instantly fetches its parent and calculates how the data changed (e.g. "+16% tokens", "-5% documents").
- **Red/Green DAG Tracking**: The visual graph highlights process nodes in explicitly color-coded borders (red for token increases, green for token drops) so you can track metric divergence intuitively.
- **Alias Tagging**: Your custom `-t` tags (like `v1-raw`) are beautifully serialized as badges in the UI so you never lose track of hashes.

### 6. Export / Import Versions
Easily share dataset versions with teammates by packing them into `.tar.gz` files.
```bash
# Export version 'v1-raw' to the 'exports' folder
marco export v1-raw ./exports/

# Import an archive sent to you by a coworker
marco import ./exports/marco_version_e5e0b767.tar.gz
```

### 7. Delete Versions
Delete a dataset version to recover disk space. Marco intelligently updates the lineage (the `parents` history) of any descendant versions so that your Git-style history tree remains intact and unbroken.
```bash
marco delete v1-raw
# or
marco rm e5e0b767
```

### 8. Diff Versions — Unified `---/+++` Format
`marco diff` prints a line-by-line unified diff directly to the terminal by default — no files written, no clutter.

**Diff the raw source data (default):**
```bash
marco diff v1.0 v2.0
```

**Diff the post-pipeline preprocessed output:**
```bash
marco diff v1.0 v2.0 --target preprocessed
```

**Diff the DAG preprocessing configuration:**
```bash
marco diff v1.0 v2.0 --target config
```

**Control how many context lines surround each hunk (default: 3):**
```bash
marco diff v1.0 v2.0 --context 10
```

**Save a full Markdown metrics report to a folder (opt-in):**
```bash
marco diff v1.0 v2.0 --save ./reports
```

**Example terminal output:**
```diff
--- v1.0 (abc12345)/raw.txt
+++ v2.0 (def67890)/raw.txt
@@ -1,5 +1,6 @@
 positive  Great product, love it!
-negative  Terrible experience.
+negative  Poor experience, not recommended.
+negative  Worst purchase I ever made.
 positive  Highly recommend to everyone.
 neutral   It was okay, nothing special.
```

---

### 9. KL Divergence & Token Analytics
Marco goes beyond simple vocabulary size tracking. It utilizes mathematical KL divergence (Kullback-Leibler) to map exactly how distribution probability shifts between two dataset versions. This tells you what kind of change happened — determining instantly if common words disappeared or rare domain terms suddenly dominated.

You can instantly compute this distribution delta by passing two version hashes or tags directly to the native command:
```bash
marco token-analytics v1-raw v2-processed
```

---

## 🔬 Reproducibility Proof

Marco guarantees that a dataset version is not just stored — it is **provably reproducible**. The `marco/reproducibility_proof/` module lets you mathematically verify that re-running the exact same pipeline on the same raw data produces a byte-identical result.

### 1. Verify a Single Version

Re-runs the full preprocessing pipeline from scratch and compares the freshly computed output hash to the stored hash in the manifest.

```bash
marco verify v1-raw
# or with a full hash
marco verify fe44032c
```

**Example output:**
```
🔬 Verifying version: fe44032c70164a71...

  Pipeline Step Verification:
    ✅  step_1 normalize_newlines
    ✅  step_2 lowercase

  Stored  output_hash: ed610672ea28e065...
  Recomputed   hash:   ed610672ea28e065...

  ✅ VERIFIED — output hash matches. Version is reproducible.
```

- On **PASS**: the manifest is updated with `"verified": true`, a timestamp, the current environment snapshot, and per-step hashes for future comparisons.
- On **FAIL**: a `verification_report.json` is written inside the version folder with a row-level delta showing exactly which rows differ. Exit code is `1`.

---

### 2. Verify All Versions at Once

Audit every version in the repository and get a clean summary table.

```bash
marco verify-all
```

**Example output:**
```
──────────────────────────────────────────────────────────────────────
  Marco Reproducibility Audit — 3 version(s) found
──────────────────────────────────────────────────────────────────────
🔬 Verifying version: a3f9c1d2...  ✅ VERIFIED
🔬 Verifying version: fe44032c...  ✅ VERIFIED
🔬 Verifying version: 83f77823...  ❌ FAIL
──────────────────────────────────────────────────────────────────────
  Summary: 2/3 versions passed.
──────────────────────────────────────────────────────────────────────
```
Returns exit code `1` if any version fails — making it CI/CD friendly.

---

### How the 7-Layer Reproducibility Engine Works

| # | Guarantee | How It Works |
|---|---|---|
| 1 | **Step-by-step hash verification** | Hashes the output after every individual DAG step. Pinpoints exactly which step introduced non-determinism (`✅ step_1`, `❌ step_2`). |
| 2 | **Byte-exact canonical enforcer** | Before hashing, data is serialized into a strictly defined canonical TSV format: Unix line endings only, fixed column order (`label`, `text`, `n_tokens`), UTF-8 encoding. Prevents false `FAIL` results caused by OS differences (Windows CRLF vs Linux LF). |
| 3 | **Row-level delta on FAIL** | When a hash mismatch occurs, computes exactly which rows were added or removed between the stored output and fresh recompute, saved in `verification_report.json`. |
| 4 | **Permanent audit trail** | Every verification run (PASS or FAIL) is appended to `.marco/verification_log.jsonl` with a timestamp, result, and full environment snapshot — a tamper-evident history. |
| 5 | **Environment fingerprinting** | Stores Python version, NumPy/Pandas versions, and platform info in the manifest at verification time. Warns on re-verification if the environment changed: `"Python 3.10 → 3.14 (may explain hash mismatch)"`. |
| 6 | **Stochastic operation detector** | Scans the pipeline DAG config for non-deterministic functions (e.g. `shuffle`, `random_sample`) that lack a `seed` parameter. Warns before verifying so expectations are set correctly. |
| 7 | **Manifest integrity update** | On PASS: sets `"verified": true` in `manifest.json` and persists per-step hashes. On FAIL: sets `"verified": false` and keeps `verification_report.json` alongside the version data. |

---

### Verification Report (`verification_report.json`)

When a version fails, Marco writes a full diagnostic file at `.marco/versions/<version_id>/verification_report.json`:

```json
{
  "version_id": "fe44032c...",
  "verified_at": "2026-03-18T14:01:00Z",
  "result": "FAIL",
  "stored_hash": "ed610672...",
  "computed_hash": "a3f9c1d2...",
  "step_hashes": {
    "step_1": "aabbcc...",
    "step_2": "ddeeff..."
  },
  "stochastic_warnings": [],
  "environment_warnings": ["python_version changed: '3.10' → '3.14'"],
  "row_delta": {
    "count_stored": 1000,
    "count_computed": 998,
    "removed": [["pos", "hello world"], ["neg", "bad product"]],
    "added": []
  }
}
```

---

## 🧠 Architecture Overview

Marco decouples logic from the file system. All core engine operations sit inside `marco/core/`, including:
- `locker.py`: File-based concurrency control using `.lock` files.
- `repository.py`: CRUD operations for dataset versions and `refs.json` tagging. Enforces **file immutability** by setting read-only permissions (`0o444`) on all version files immediately after creation.
- `preprocessor.py`: A robust Directed Acyclic Graph (DAG) preprocessing engine with deterministic topological execution.
- `comparator.py`: Version comparison — Markdown metric reports (`compare_versions`) and classic `---`/`+++` unified diffs (`diff_versions_text`) using Python's stdlib `difflib`.

### Reproducibility Proof Engine (`marco/reproducibility_proof/`)
- `canonicalizer.py`: Enforces byte-exact canonical serialization (Unix line endings, fixed column order). Also detects stochastic/non-deterministic pipeline operations.
- `audit_log.py`: Appends every verification run (PASS or FAIL) to `.marco/verification_log.jsonl` with a full environment snapshot.
- `verify.py`: End-to-end reproducibility verification with step-by-step hashing, row-level delta reports, environment fingerprinting, and a `verify_all()` batch audit command.

Have fun building safer machine learning pipelines!
