Metadata-Version: 2.4
Name: marco-dvcs
Version: 0.1.17
Summary: A minimal dataset versioning system for text data with a focus on reproducibility.
Home-page: https://github.com/Team-Marco-ACM/marco-package
Author: Your Name
Author-email: your.email@example.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: Flask
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Marco Dataset Versioning System

A minimal dataset versioning system for text data with a strong focus on reproducibility and transparency. Treat your text datasets like code — immutable, versioned, reproducible, and explainable.

Marco acts as a lightweight Python library, meaning you can initialize it in *any* machine learning project folder to safely version and preprocess your datasets without altering your original files.

## 🚀 Installation (Linux / MacOS)

On modern Linux environments (like Arch Linux, Ubuntu 23.04+), Python packages must be installed in a Virtual Environment (PEP 668) to prevent conflicts with your system packages. 

Follow these steps to safely install Marco into your ML project:

1. **Clone this repository** to your local machine:
   ```bash
   git clone https://github.com/your-username/marco.git
   cd marco
   ```

2. **Navigate to the ML project folder** where you want to train your model (e.g. your bag-of-words project):
   ```bash
   cd ~/projects/my-bag-of-words-model
   ```

3. **Create and activate a Python Virtual Environment**:
   ```bash
   # Create a virtual environment named 'venv'
   python3 -m venv venv
   
   # Activate it (You must do this every time you open a new terminal in this folder)
   source venv/bin/activate
   ```
   *(You should now see `(venv)` at the start of your terminal prompt!)*

4. **Install Marco**:
   ```bash
   # Point pip to the directory where you cloned the marco repository
   pip install -e /path/to/marco
   ```

---

## 🛠️ Usage Guide

Once `marco` is installed in your virtual environment, you have access to the full CLI!

### 1. Initialize a Repository
Initialize Marco tracking in your current directory. This creates a `.marco/` data versioning environment specific to that project.
```bash
marco init
```

### 2. Create an Immutable Version 
Upload a text/CSV/TSV dataset to create an immutable version. Marco will compute a cryptographically secure SHA-256 hash using the raw data + the preprocessing configuration.

**Interactive Mode:**
If you don't supply a configuration file, Marco will interactively guide you through building the preprocessing pipeline (Lowercasing, Tokenization, Stopwords Removal, Deduplicating).
```bash
marco upload my_dataset.csv -t v1-raw
```

**Config Mode:**
```bash
marco upload my_dataset.csv -c my_config.json -t v1-processed
```

### 3. List Versions
View all the versions you've created, along with their tags and timestamps.
```bash
marco list
```

### 4. Restore/Checkout Data
Extract the processed dataset from marco's storage back into your active workspace to use for model training.
```bash
marco restore v1-processed -o ./training_data.tsv
```

### 5. Start the Interactive Dashboard (New!)
Marco includes a stunning, built-in Glassmorphism React UI to visualize your dataset evolution.
```bash
marco generate-web
```
*(This will automatically open your browser to `http://localhost:7654` and load your `.marco` repository).*

**Dashboard Features:**
- **Interactive Lineage Tree**: View the exact Git-style chronological history of your datasets in a visual tree.
- **Auto-Compare Mode**: Clicking any dataset instantly fetches its parent and calculates how the data changed (e.g. "+16% tokens", "-5% documents").
- **Red/Green DAG Tracking**: The visual graph highlights process nodes in explicitly color-coded borders (red for token increases, green for token drops) so you can track metric divergence intuitively.
- **Alias Tagging**: Your custom `-t` tags (like `v1-raw`) are beautifully serialized as badges in the UI so you never lose track of hashes.

### 6. Export / Import Versions
Easily share dataset versions with teammates by packing them into `.tar.gz` files.
```bash
# Export version 'v1-raw' to the 'exports' folder
marco export v1-raw ./exports/

# Import an archive sent to you by a coworker
marco import ./exports/marco_version_e5e0b767.tar.gz
```

### 7. Delete Versions
Delete a dataset version to recover disk space. Marco intelligently updates the lineage (the `parents` history) of any descendant versions so that your Git-style history tree remains intact and unbroken.
```bash
marco delete v1-raw
# or
marco rm e5e0b767
```

### 8. KL Divergence & Token Analytics
Marco goes beyond simple vocabulary size tracking. It utilizes mathematical KL divergence (Kullback-Leibler) to map exactly how distribution probability shifts between two dataset versions. This tells you what kind of change happened — determining instantly if common words disappeared or rare domain terms suddenly dominated.

You can instantly compute this distribution delta by passing two version hashes or tags directly to the native command:
```bash
marco token-analytics v1-raw v2-processed
```

---

## 🧠 Architecture Overview

Marco decouples logic from the file system. All core engine operations sit inside `marco/core/`, including:
- `locker.py`: File-based concurrency control using `.lock` files.
- `repository.py`: CRUD operations for dataset versions and `refs.json` tagging.
- `preprocessor.py`: A robust Directed Acyclic Graph (DAG) preprocessing engine.

Have fun building safer machine learning pipelines!
