Metadata-Version: 2.4
Name: marco-dvcs
Version: 0.1.67
Summary: A minimal dataset versioning system for text data with a focus on reproducibility.
Home-page: https://github.com/Team-Marco-ACM/marco-package
Author: Your Name
Author-email: your.email@example.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: Flask
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Marco Dataset Versioning System (DVCS)

A minimal dataset versioning system for text data with a strong focus on reproducibility and transparency. Treat your text datasets like code — immutable, versioned, reproducible, and explainable.

Marco acts as a lightweight Python CLI tool that you initialize inside *any* ML project folder to safely version and preprocess your datasets without altering your original files.

---

## 🚀 Installation

```bash
# Recommended: install inside a virtual environment
python -m venv venv
source venv/bin/activate        # Linux / macOS
venv\Scripts\activate           # Windows

pip install marco-dvcs
```

---

## 🗂️ Quick Command Reference

| Command | Description |
|---------|-------------|
| `marco init` | Initialize a Marco repository in the current directory |
| `marco upload <file>` | Create a versioned, preprocessed dataset snapshot |
| `marco list` | List all tracked versions |
| `marco lineage` | Print ASCII lineage tree |
| `marco diff <v1> <v2>` | Compare two versions (unified diff or metrics) |
| `marco restore <version>` | Restore processed data to your workspace |
| `marco attach-results <version> <file>` | Attach a training results file to an existing version |
| `marco export <version> <dest>` | Export a version to `.tar.gz` |
| `marco import <tarball>` | Import a version from a `.tar.gz` |
| `marco delete <version>` | Delete a version and heal the lineage |
| `marco token-analytics <v1> <v2>` | KL-divergence token shift analysis |
| `marco evaluate <version>` | Dataset structural health report |
| `marco drift <v1> <v2>` | Detect model performance degradation |
| `marco verify <version>` | Cryptographic reproducibility proof |
| `marco verify-all` | Audit every version in the repo |
| `marco generate-web` | Launch the interactive web UI |

---

## 🛠️ Usage Guide

### 1. Initialize a Repository

Initialize Marco tracking in your current project directory. Creates a `.marco/` folder with registry, chain, and lock files.

```bash
marco init
```

---

### 2. Create an Immutable Version (`upload`)

Upload a raw `.txt`, `.csv`, or `.tsv` dataset. Marco runs the preprocessing pipeline, computes a SHA-256 hash, and creates an immutable snapshot.

**Interactive Mode** — no config file needed; Marco prompts you step by step:
```bash
marco upload my_dataset.csv -t v1-raw
```

**Config Mode** — supply a `config.json` to skip the prompts:
```bash
marco upload my_dataset.csv -c config.json -t v1-processed
```

**With Training Results** — bundle a results file at upload time:
```bash
marco upload my_dataset.csv -t v1-trained -r ./results/training_results.txt
```

> **Note:** During upload, Marco auto-generates `raw_stats.json` (pre-pipeline statistics) and `post_stats.json` (post-pipeline reduction statistics) inside the version directory.

The config file format for a preprocessing pipeline (`config.json`):
```json
{
  "dag": {
    "step_1_normalize_newlines": { "func": "normalize_newlines", "params": {}, "depends_on": [] },
    "step_2_lowercase":          { "func": "lowercase",          "params": { "enabled": true }, "depends_on": ["step_1_normalize_newlines"] },
    "step_3_remove_stopwords":   { "func": "remove_stopwords",   "params": { "language": "english" }, "depends_on": ["step_2_lowercase"] },
    "step_4_filter_length":      { "func": "filter_length",      "params": { "min_tokens": 5, "max_tokens": 200 }, "depends_on": ["step_3_remove_stopwords"] },
    "step_5_deduplicate":        { "func": "deduplicate",        "params": { "method": "exact" }, "depends_on": ["step_4_filter_length"] }
  }
}
```

Available pipeline functions:

| Function | Params | Description |
|----------|--------|-------------|
| `normalize_newlines` | — | Collapse `\r\n` and `\r` to `\n` |
| `lowercase` | `enabled: bool` | Convert all text to lowercase |
| `tokenize` | `method: "whitespace"\|"word"\|"char"` | Tokenize text (adds `n_tokens` column) |
| `remove_stopwords` | `language: "english"\|...` | Remove common stopwords |
| `filter_length` | `min_tokens: int, max_tokens: int` | Drop docs outside token range |
| `deduplicate` | `method: "exact"` | Remove duplicate rows |

---

### 3. Attach Training Results (Decoupled Workflow)

If you train *after* uploading, attach results to the existing version later:

```bash
# Step 1 — restore preprocessed data for training
marco restore v1-raw -o ./my_training_data.tsv

# Step 2 — train your model, then attach the results
marco attach-results v1-raw training_results.txt
# alias:
marco attach v1-raw training_results.txt
```

---

### 4. List Versions

```bash
marco list
```

Prints a table of all versions with short hash, timestamp, user, and tags.

---

### 5. View Lineage Tree

ASCII-art visualisation of the full project history up to the Root Node:

```bash
marco lineage
```

---

### 6. Restore / Checkout Data

Extract the processed dataset back into your workspace for training:

```bash
marco restore v1-processed -o ./training_data.tsv
# alias:
marco checkout v1-processed -o ./training_data.tsv
```

---

### 7. Export / Import Versions

Share dataset versions with teammates as portable `.tar.gz` archives:

```bash
# Export
marco export v1-raw ./exports/

# Import
marco import ./exports/marco_version_e5e0b767.tar.gz
```

---

### 8. Delete a Version

Deletes the version files and heals the lineage so descendant version links remain valid:

```bash
marco delete v1-raw
# alias:
marco rm e5e0b767
```

---

### 9. Diff Versions

Compare two versions with a human-readable metric summary and unified line-by-line diff:

```bash
# Diff raw input (default)
marco diff v1.0 v2.0

# Diff preprocessed output
marco diff v1.0 v2.0 --target preprocessed

# Diff DAG config
marco diff v1.0 v2.0 --target config

# Control context lines (default 3)
marco diff v1.0 v2.0 --context 10

# Save a Markdown metrics report instead of printing to terminal
marco diff v1.0 v2.0 --save ./reports
```

Example output:
```diff
📊 Summary: v1.0 → v2.0
  ► N Documents: 2.00% increase (50 → 51)
  ► N Tokens: 15.30% increase (1000 → 1153)
  ► Vocab Size: 5.10% increase (300 → 315)
─────────────────────────────────────────────────────────────────

--- v1.0 (abc12345)/raw.txt
+++ v2.0 (def67890)/raw.txt
@@ -1,5 +1,6 @@
  positive  Great product, love it!
-negative  Terrible experience.
+negative  Poor experience, not recommended.
+negative  Worst purchase I ever made.
```

---

### 10. Token Analytics (KL Divergence)

Measure how token probability distributions shift between two versions using Kullback-Leibler divergence:

```bash
marco token-analytics v1-raw v2-processed
```

Returns a JSON report with vocabulary overlap, token frequency shifts, and top divergent terms.

---

### 11. Evaluate Dataset Health

Get a comprehensive structural health report for a single version — pipeline determinism grade, data volatility index, and reproducibility confidence score:

```bash
marco evaluate v1-processed
```

Example output:
```
📊 Evaluating Structural Health of Dataset 'v1-processed'...

  ► Pipeline Determinism... [A Grade]
  ► Computing Data Volatility (vs. Raw)... [12% Volatile]
  ► Running Reproducibility Proofs... [High Confident]

✅ Evaluation Complete! Overall Health: EXCELLENT
📄 Generating full report: .marco/versions/<id>/evaluation_report.json
```

---

### 12. Model Drift Detection

Detect performance degradation when training data changes. Marco extracts metrics from your attached `training_results.txt` (regex-based) and runs an internal NaiveBayes model against the new data in-memory:

```bash
marco drift v1-trained v2-processed
```

Example output:
```
MODEL DRIFT ANALYSIS
════════════════════════════════════════
Trained on:   v1-trained (fe4403)
Evaluated on: v2-processed (a3f9c1)
════════════════════════════════════════
Metric       Baseline    New Data    Drop     Severity
────────────────────────────────────────────────────
Accuracy     0.700       0.612      -12.6%   CRITICAL
F1           0.547       0.485      -11.3%   CRITICAL
════════════════════════════════════════
Overall: RETRAIN
```

---

### 13. Reproducibility Verification

#### Verify a single version
Re-runs the pipeline from scratch and compares the hash against the stored manifest:

```bash
marco verify v1-raw
# also accepts full hash
marco verify fe44032c
```

Example output:
```
🔬 Verifying version: fe44032c70164a71...

  Pipeline Step Verification:
    ✅  step_1 normalize_newlines
    ✅  step_2 lowercase

  Stored  output_hash: ed610672ea28e065...
  Recomputed   hash:   ed610672ea28e065...

  ✅ VERIFIED — output hash matches. Version is reproducible.
```

#### Verify all versions at once
```bash
marco verify-all
```

Returns exit code `1` if any version fails — CI/CD compatible.

---

### 14. Web UI (`generate-web`)

Launch a local interactive web UI to visually explore all your tracked versions, their DAG pipelines, preprocessed data, training results, and health statistics.

```bash
marco generate-web
```

Opens `http://localhost:7654` in your browser automatically. Features include:

| View | Description |
|------|-------------|
| **DAG** | Interactive pipeline graph — click any node to inspect step metrics and parameters |
| **Code** | Pipeline steps rendered as readable code blocks |
| **Data** | Browse raw input, preprocessed output, training results, and config per version |
| **Sidebar** | Git-style lineage tree grouped by chain with ref labels (`v1-raw`, `v2-updated`, etc.) |
| **Summary Bar** | Key metrics (docs, tokens, vocab) with +/- delta when comparing two versions |

Press **Ctrl+C** to stop the server.

---

## 🔬 Reproducibility Engine — 7-Layer Architecture

| # | Guarantee | How |
|---|-----------|-----|
| 1 | **Step-by-step hash verification** | Hashes output after every DAG step; pinpoints the exact step that introduced non-determinism |
| 2 | **Byte-exact canonical enforcer** | Serializes data to a fixed TSV format (Unix LF, fixed columns, UTF-8) before hashing |
| 3 | **Row-level delta on FAIL** | Computes exactly which rows differ and saves them in `verification_report.json` |
| 4 | **Permanent audit trail** | Every run appended to `.marco/verification_log.jsonl` (tamper-evident) |
| 5 | **Environment fingerprinting** | Stores Python / NumPy / Pandas versions; warns on re-run if environment changed |
| 6 | **Stochastic operation detector** | Warns if pipeline contains `shuffle` / `random_sample` without a `seed` |
| 7 | **Manifest integrity update** | Sets `"verified": true/false` in `manifest.json` after every run |

---

## 🧠 Architecture Overview

```
marco/
├── cli/
│   ├── main.py          — CLI entry point, all command routing
│   ├── interactive.py   — Interactive pipeline builder (prompts)
│   └── web_serve.py     — HTTP server for the React web UI
├── core/
│   ├── repository.py    — Version CRUD, refs.json tagging, file immutability
│   ├── preprocessor.py  — DAG-based preprocessing engine (topological sort)
│   ├── comparator.py    — Unified diff and Markdown metric reports
│   └── evaluate.py      — Structural health evaluation (DVI, RCS grades)
├── token_analytics/
│   └── distribution.py  — KL divergence & token frequency shift analysis
├── model_drift/         — In-memory NaiveBayes drift detection
├── reproducibility_proof/
│   ├── verify.py        — End-to-end reproducibility verification
│   ├── canonicalizer.py — Byte-exact canonical serialization
│   └── audit_log.py     — Append-only verification log
└── ui/                  — React + Vite web UI source
    └── dist/            — Built UI assets served by generate-web
```

---

Have fun building safer machine learning pipelines! 🚀
